VIEW
Communicated by Zach Mainen
Mechanisms Shaping Fast Excitatory Postsynaptic Currents in the Central Nervous System Mladen I. GlavinoviÂc
[email protected] Departments of Anaesthesia Research and Physiology, McGill University, Montreal PQ H3G 1Y6, Canada
How different factors contribute to determine the time course of the basic element of fast glutamate-mediated excitatory postsynaptic currents (mEPSCs) in the central nervous system has been a focus of interest of neurobiologists for some years. In spite of intensive investigations, these mechanisms are not well understood. In this review, basic hypotheses are summarized, and a new hypothesis is proposed, which holds that desensitization of AMPA receptros plays a major role in shaping the time course of fast mEPSCs. According to the new hypothesis, desensitization shortens the time course of mEPSCs largely by reducing the buffering of glutamate molecules by AMPA receptors. The hypothesis accounts for numerous ndings on fast mEPSCs and is expected to be equally fruitful as a framework for further experimental and theoretical investigations. 1 Factors Shaping the Kinetics of Fast mEPSCs: Historical Background Fusion of synaptic vesicles with presynaptic membrane of excitatory synapses in the central nervous system leads to a release of glutamate. Glutamate is believed to reach high concentrations before decaying rapidly (Clements, 1996). The kinetics of glutamate release, diffusion, binding by the receptors, and uptake by the transporters, as well as synaptic geometry, are all expected to inuence its spatiotemporal concentration prole in the synaptic cleft, which together with the receptor properties (kinetics, density, and spatial distribution) determine the amplitude and the time course of the basic element of synaptic transmission: miniature excitatory postsynaptic current (mEPSC). The fast component of the excitatory postsynaptic current (EPSC) in neurons of the central nervous system results from activation of AMPA (a-amino-3-hydroxy-methyl-isoxazole)–type glutamate receptors. How different factors contribute to shaping the time course of mEPSCs has been a focus of interest of neurobiologists for some years. In spite of intensive investigations, these mechanisms are not well understood, though determining them is critical to issues of synaptic specicity and the induction of synaptic plasticity. In this review, basic hypotheses of Neural Computation 14, 1–19 (2001)
° c 2001 Massachusetts Institute of Technology
2
Mladen I. GlavinoviÂc
mechanisms shaping the time course of fast glutamate-mediated excitatory postsynaptic currents in the central nervous system are summarized and a new hypothesis is proposed. Three hypotheses have been traditionally put forward to explain the rate of decay of the synaptic current (Jonas & Spruston, 1994). The rst follows the line of argument used to explain the time course of postsynaptic current at the neuromuscular junction (Magleby & Stevens, 1972). The decay of glutamate concentration in the synaptic cleft is assumed to be very rapid, the decay time of the postsynaptic current being therefore determined by channel closure, similar to deactivation (Hestrin, Sah, & Nicoll, 1990; Tang, Shi, Katchman, & Lynch, 1991). The second hypothesis argues that the decay of glutamate concentration is slow and that the postsynaptic current is terminated by desensitization of AMPA receptor channels (Trussell, Thio, Zorumski, & Fischbach, 1988; Trussell & Fischbach, 1989; Isaacson & Nicoll, 1991; Vyklicky, Patneau, & Mayer, 1991; Hestrin, 1992; Trussell, Zhang, & Raman, 1993; Livsey, Costa, & Vicini, 1993; Raman & Trussell, 1995). This is plausible since AMPA receptor channels desensitize very rapidly (they are the fastest desensitizing ligandgated ion channels). According to the third hypothesis, the postsynaptic current is determined in a complex manner by the time course of deactivation and desensitization, as well as glutamate concentration (its decay has been argued to be between the two extremes postulated in the rst and second hypothesis; Clements, Lester, Tong, Jahr, & Westbrook, 1992; Barbour, Keller, Llano, & Marty, 1994; Tong & Jahr, 1994a). 2 Deactivation and Desensitization of Currents Evoked by Glutamate, Released Either Synaptically or by Local Application Deactivation is by convention dened as the decrease of current following the end of a very brief pulse of glutamate applied to excised outside-out membrane patches. Since it reects the random closure of AMPA channels in the absence of glutamate, deactivation should determine the rate of decay of mEPSCs if the glutamate transient in the synaptic cleft is very fast. Intrinsic desensitization is dened as the decrease in current observed during a prolonged application of glutamate. The time constants of deactivation and desensitization of AMPA receptors have been studied in membrane patches from a variety of central neurons to evaluate these hypotheses. In inhibitory interneurons of neocortex (Jonas, Racca, Sakmann, Seeburg, & Monyer, 1994) and hippocampus (Livsey et al., 1993) and in cochlear nucleus neurons (Raman & Trussell, 1992; Trussell et al., 1993), the gating of AMPA channels is rapid; it is slower in hippocampal granule cells and pyramidal neurons and neocortical pyramidal neurons (Colquhoun, Jonas, & Sakmann, 1992; Hestrin, 1992, 1993; Jonas et al., 1994). In all these cells,
Desensitization and Excitatory Synaptic Currents
3
the time course of postsynaptic currents is well correlated with the time course of currents induced by short pulses of glutamate (Raman & Trussell, 1992; Silver, Traynelis, & Cull-Candy, 1992; Stern, Edwards, & Sakmann, 1992; Jonas, Major, & Sakmann, 1993). Because intrinsic desensitization is typically two- to four-fold slower than the decay of well-clamped fast EPSCs, many authors have argued that at these synapses, desensitization is too slow to affect the time course of mEPSCs or even EPSCs (Colquhoun et al., 1992; Hestrin, 1992; Stern et al., 1992; Trussell et al., 1993; Jonas et al., 1993; Partin, Patneau, Winters, Mayer, & Buonanno, 1993; Jonas et al., 1994; Jonas & Spruston, 1994; Raman & Trussell, 1995; Edmonds, Gibb, & Colquhoun, 1995; Hausser & Roth, 1997). 3 Need for Reevaluation For several reasons, these notions need to be reevaluated. First, the apparent time course of intrinsic desensitization is not necessarily a good indicator of the entry into desensitized states, which may develop much faster; thus, a glutamate pulse much shorter than the time course of desensitization leads to a diminished response to a subsequent short test pulse of glutamate (Raman & Trussell, 1995). Second, the decay time of mEPSCs becomes markedly amplitude dependent when AMPA receptor desensitization is blocked by aniracetam or cyclothiazide (Ghamari-Langroudi & GlavinoviÂc, 1998; Atassi & GlavinoviÂc, 1999). This nding suggests that (1) desensitization of synaptic currents depends on the number of glutamate molecules released and therefore is concentration dependent (unlike intrinsic desensitization), and (2) in the absence of desensitization, another process renders the decay phase of mEPSCs concentration dependent. Alternative explanations may be given for the amplitude dependence of mEPSCs in the absence of desensitization. Larger mEPSCs may be of multivesicular origin, and the positive correlation may arise at least partly from asynchronous vesicular release. The recent report that spontaneous release becomes less multivesicular with maturation (Wall & Usowicz, 1998) argues against such an explanation (both studies were done in hippocampal slices from adult rats; Ghamari-Langroudi & GlavinoviÂc, 1998; Atassi & GlavinoviÂc, 1999). Another explanation would be that the deactivation rate is concentration dependent. High glutamate concentrations may preferentially activate large-conductance channels (Rosenmund, Stern-Bach, & Stevens, 1998) that remain open for longer times (Stern-Bach, Russo, Neuman, & Rosenmund, 1998). Though plausible, it seems an unlikely explanation for dentate granule cells or CA1 pyramidal cells of hippocampus; in excised patches, decay times of currents produced by short (1 ms) pulses of glutamate are essentially concentration independent (over the range 0.2– 1.0 mM). In CA3 pyramidal cells of hippocampus, however, such a mechanism may be more important (Colquhoun et al., 1992).
4
Mladen I. GlavinoviÂc
4 Current Observations Our recordings in hippocampal slices showed that the removal of desensitization by specic pharmacological agents prolonged mEPSCs and greatly enhanced the amplitude dependence of their decay times. To investigate this phenomenon further, we made a detailed simulation of the underlying mechanisms using well-established Monte Carlo methods to examine glutamate release into the synaptic cleft, and its interaction with AMPA receptors, and the interaction of short and long pulses of glutamate with AMPA receptors in excised patches. More specically, we aimed to clarify what factors shape the amplitude and time course of mEPSCs and what determines the relationship among the time course of intrinsic desensitization, deactivation, mEPSCs, and the time course of the occupancy of desensitized states (for synaptic and patch currents; GlavinoviÂc & Rabie, 1998). The Monte Carlo method follows individual glutamate molecules as they diffuse randomly within the synaptic cleft and interact with the postsynaptic receptors (see Figure 1A; Bartol, Land, Salpeter, & Salpeter, 1991; Wahl, Pouzat, & Stratford, 1996; Kruk, Horn, & Faber, 1997; GlavinoviÂc & Rabie, 1998; GlavinoviÂc, 1999). Our model assumes a 200 £ 200 nm synaptic contact area and 15 nm wide cleft, bounded by a three-dimensional innite space. Further assumptions are that the movement of a glutamate molecule depends only on its present position (and not on its history) and that the probability of each receptor’s remaining in its present state or changing to another other state depends only on its present state. Both the movement and the gating kinetics are thus Markovian processes. In the kinetic scheme for the receptors (see Figure 1B; Jonas et al., 1993), two glutamate molecules must bind to the receptor before the channel can open. AMPA receptor can therefore be unbound (U), singly bound (SB), doubly bound (DB), open (O), or in one of the three desensitized states (D1 , D2 , or D3 ). The interaction between glutamate molecules and receptors is dened by the three rates in the kinetic scheme (K C 1 , K C 2 , and K C 3 ) and by associating with each receptor state a surface area and a probability of binding, given that a glutamate molecule hits this receptor surface. There is no agreement concerning the level of occupancy of AMPA receptors at the peak of the current. Although AMPA receptors may saturate at some synapses, several lines of experimental evidence indicate that this is unlikely to occur in the cerebellum and hippocampus: noise analysis at single-site synapses in the cerebellum (Silver, Colquhoun, Cull-Candy, & Edmonds, 1996), the high coefcient of variation for single-site mEPSCs in hippocampal culture (Forti, Bossi, Bergamaschi, Villa, & Malgaroli, 1997), and the effects of local agonist applications at single-site synapses in hippocampal cultures (Liu, Choi, & Tsien, 1999). We therefore simulated mEPSCs generated by the release of glutamate molecules, varying in number over a very wide range (150–20,000) and covering all degrees of saturation.
Desensitization and Excitatory Synaptic Currents
5
Figure 1: Schematic models of glutamate release into synaptic cleft and the kinetics of gating of AMPA receptors. (A) Diagram (approximately to scale) showing how glutamate molecules, after release from an instantaneous point source, diffuse in synaptic cleft and interact with postsynaptic receptors. (B) Kinetic scheme of gating of AMPA receptor channels by glutamate, used for the Monte Carlo simulations. U, SB, DB, and O indicate the unbound, singly bound, doubly bound, and open state respectively. D1 , D2 , and D3 are three desensitized states. The rate constants, taken from Jonas et al. (1993) were adjusted for the temperature of simulations (37± C) and assuming a Q10 of 4.0 (GlavinoviÂc & Rabie, 1998). They are: K C 1 D 3.67 £ 107 M¡1 , k¡1 D 3.41 £ 104 s ¡1 , K C 2 D 22.7 £ 107 M¡1 , k¡2 D 2.61 £ 104 s¡1 , K C 3 D 1.02 £ 107 M¡1 , k¡3 D 366s¡1 for glutamate binding; a D 3.39 £ 104 s ¡1 , b D 7200s ¡1 , for channel opening; and a1 D 2.31 £ 104 s ¡1 , b1 D 314s ¡1 , a2 D 1376s ¡1 , b2 D 5.82s ¡1 , a3 D 142s ¡1 , b3 D 32s ¡1 , a4 D 134.4s ¡1 , b4 D 1523s ¡1 for the desensitization pathway.
There was good agreement between the predictions of the simulations and experimental observations: Both rise and decay times of mEPSCs are prolonged when desensitization is reduced or abolished, but the amplitude changes only marginally (Isaacson & Walmsley, 1996; Ghamari-Langroudi & GlavinoviÂc, 1998; Atassi & GlavinoviÂc, 1999). The decay phase of mEPSCs, which is not (or is only modestly) amplitude dependent in the presence of desensitization becomes markedly so when desensitization is absent (see Figure 2A; Isaacson & Walms-
6
Mladen I. GlavinoviÂc
Figure 2: Under control conditions, the decay times of fast unitary excitatory postsynaptic current (mEPSC) are only marginally amplitude dependent, but they become markedly so when desensitization is suppressed. (A) Amplitude dependence of the decay time of mEPSCs, recorded from CA1 pyramidal cells in rat hippocampal slices, in control conditions and after suppression of AMPA receptor desensitization by 100 m M cyclothiazide (cyclo; from Atassi & GlavinoviÂc, 1999). (B) Amplitude dependence of the decay time of simulated fast mEPSCs, when desensitization is either present (control) or absent (No Des), with either single (Single Bind) or repeated (Repeat Bind) glutamate binding to receptors permitted. The amplitude was normalized to 1.0, when all AMPA channels are open. Over a wide range, the amplitude dependence of simulated mEPSCs is evident only if there is repeated binding; this indicates that desensitization shortens mEPSCs and markedly reduces the amplitude dependence of their decay times by eliminating glutamate repeated binding (“buffering”) by AMPA receptors. Slower decay of observed mEPSCs (A versus B) can be ascribed to the lower temperature of recordings (32± versus 37± for the simulations), without excluding likely different kinetics in CA1 versus CA3 neurons (rate constants listed in Figure 1B were obtained from CA3 neurons; Jonas et al., 1993).
Desensitization and Excitatory Synaptic Currents
7
ley, 1996; Ghamari-Langroudi & GlavinoviÂc, 1998; Atassi & GlavinoviÂc, 1999). Intrinsic desensitization is concentration independent (Colquhoun et al., 1992) and two to three times slower than the decay of deactivation or the decay of “perfectly clamped” mEPSCs (see Figures 2B and 3A; Colquhoun et al., 1992; Hestrin, 1992; Stern et al., 1992; Trussell et al., 1993; Jonas et al., 1993, 1994; Partin et al., 1993; Jonas & Spruston, 1994; Raman & Trussell, 1995; Edmonds et al., 1995; Hausser & Roth, 1997; GlavinoviÂc & Rabie, 1998). Deactivation and perfectly clamped mEPSCs have very similar decay times (see Figures 2B and 3A; Raman & Trussell, 1992; Silver et al., 1992; Stern et al., 1992; Jonas et al., 1993; GlavinoviÂc & Rabie, 1998). 5 Revised Desensitization Hypothesis Although it is clear that desensitization plays a major role in shaping the time course of post-synaptic currents, the picture that emerges from these simulations differs considerably from the classical desensitization hypothesis (the second hypothesis). According to our revised hypothesis, desensitization shapes the time course of mEPSCs largely by limiting the repeated binding of glutamate molecules to AMPA receptors (i.e., by reducing glutamate buffering by AMPA receptors) and, to a lesser extent, through the termination of burst openings of AMPA channels. This conclusion is indicated by two sets of observations on simulated mEPSCs. First, if only single binding of glutamate is permitted, their decay times are very similar (brief and virtually amplitude independent) in both the presence and absence of desensitization (see Figure 2B); second, if multiple glutamate binding is permitted, the decay times are long and strongly amplitude dependent in the presence of desensitization. The short duration and the similar time course of mEPSCs and deactivation in the presence of desensitization can thus be explained by the absence of repeated binding of individual glutamate molecules. In excised patches, this happens because only a brief pulse of glutamate is applied; during mEPSCs, repeated binding is curtailed by desensitization. The decay times are thus very similar not due to the lack of desensitization but because of its importance. The occupancy of the single bound activatable state is especially relevant if one wishes to understand what determines the time course of mEPSCs. As the simulations indicate, suppression of desensitization leads to a higher occupancy of the single bound state (see Figure 4A), as glutamate molecules both unbind from it and subsequently rebind more frequently. The high rate of glutamate binding to the singly bound state (it is 6.2 times greater than binding to the unbound state; see Figure 1B; see also Jonas et al., 1993; Wahl et al., 1996; GlavinoviÂc & Rabie, 1998) further
8
Mladen I. GlavinoviÂc
Figure 3: Although decay of intrinsic desensitization is slower than decay of deactivation, the desensitization of AMPA receptors during an mEPSC is very pronounced and depends strongly on how many molecules of glutamate are released (Monte Carlo simulations). (A) Decay time due to intrinsic desensitization (dened as a decrease in responsiveness to a constant prolonged pulse of glutamate) is two to three times longer than the decay time by deactivation, although most AMPA receptors enter into desensitized states during an mEPSC. (B) Fraction of the AMPA channels in the unbound state (lled and empty circles) is higher in the absence of desensitization. Desensitization of AMPA receptors is very pronounced and strongly depends on the number of molecules released (all desensitized states are taken together; lled squares). The occupancy of all states (unbound and desensitized) was estimated 1 ms after instantaneous release (from GlavinoviÂc & Rabie, 1998).
Desensitization and Excitatory Synaptic Currents
9
Figure 4: In the absence of desensitization, both the fraction of bound receptors and late, but not early, phase of cleft concentration of glutamate are higher regardless of how many glutamate molecules are released (Monte Carlo simulations). The fraction of single- (circles) and double- (triangles) bound AMPA receptors (A) and the synaptic cleft concentration (C) as a function of the number of molecules of glutamate molecules released, in the presence and absence of desensitization. The occupancies of all receptor states and the concentration of glutamate were estimated 1 ms after instantaneous release (from GlavinoviÂc & Rabie, 1998). (B) The fast constant of the cleft concentration is, however, independent of the number of molecules released, the presence of AMPA receptors, and their ability to desensitize. Smooth lines are the best multiexponential ts to the concentration traces.
enhances the frequency of repeated binding. Its proximity to the open state (unlike that of the unbound or desensitized state D1; see Figure 2A) results in prolongation of mEPSCs and greater synaptic efcacy (i.e., a higher
10
Mladen I. GlavinoviÂc
probability of opening of AMPA channels). Higher occupancy of the single bound state raises the occupancy of the double-bound state (DB; see Figure 1B) and thus enhances the likelihood of repeated opening of AMPA channels. Simulations also reveal that during and following an mEPSC, the occupancies of all states, including desensitized states, depend on the number of molecules released (i.e., they are concentration dependent; see Figures 3 and 4). This is not surprising given that three out of seven kinetic rates are concentration dependent (see Figure 1B). The postulated concentration dependence of desensitization is not in contradiction experimentally (Jonas et al., 1993) or theoretically (GlavinoviÂc & Rabie, 1998) with the well-documented concentration independence of desensitization in excised patches (intrinsic desensitization; see Figure 3A). Desensitization that develops during a single mEPSC is concentration dependent mainly because it abolishes glutamate buffering—a concentration-dependent process that is not present during desensitization in excised patches. Finally, our simulations remove one of the main objections to the idea that desensitization plays an important role in shaping the time course of mEPSCs: the supposed requirement that the time course of glutamate concentration in the synaptic cleft be very slow (Edmonds et al., 1995). The decay of glutamate concentration is actually predicted to be faster in the presence of desensitization of AMPA channels because of reduced buffering of glutamate by AMPA receptors (see Figure 4B and Figure 5). A word of caution is, however, necessary. The comparative importance of desensitization as an inactivation pathway as opposed to its role in diminishing glutamate buffering may vary from synapse to synapse and may depend on external conditions. Desensitization is more likely to act as an inactivation pathway if the synaptic cleft is wide and the synaptic contact area small. The kinetics of gating of AMPA channels and their density are also signicant factors. The theoretical simulations (Wahl et al., 1996; GlavinoviÂc & Rabie, 1998; GlavinoviÂc, 1999) assumed that all rates have the same Q10, but that is not necessarily the case. If the association rates have lower Q10 (Jones, Sahara, Dzubay, & Westbrook, 1998), desensitization would have a greater importance as an inactivation pathway at higher temperatures than we evaluated. In the simulations, all glutamate molecules that reached the edge of the synapse diffused freely into the surrounding innite space (GlavinoviÂc & Rabie, 1998; GlavinoviÂc, 1999). If, as seems likely, there are cellular barriers to such free diffusion (e.g., neuroglia), they would slow the clearance of glutamate from the synaptic cleft and enhance the role of desensitization in diminishing glutamate buffering by AMPA receptors—bearing in mind that signicant removal of glutamate by cellular uptake probably accelerates the clearance of glutamate (see below); this needs to be included explicitly in further simulations. In any case, the experimental ndings strongly support the idea that repeated binding is important in shaping the time course of
Desensitization and Excitatory Synaptic Currents
11
Figure 5: Diagram compares the time course of glutamate concentration in synaptic cleft and unitary excitatory postsynaptic current (below), in the presence and absence of desensitization. (Top) Glutamate concentration appears as a large but short pulse that decays instantaneously and is followed by a small, prolonged tail. According to our model, the effect of glutamate is not only more evanescent in the presence of AMPA receptor desensitization, but its concentration in the synaptic cleft decays faster. Both effects are closely linked to a higher occupancy of the single bound receptor state that produces the greater and more prolonged postsynaptic effect (downward arrows) but also elevates its concentration in the synaptic cleft, owing to greater buffering by the AMPA receptors (curved upward arrows).
mEPSCs at low (22–27± C) and especially at high (32± C) temperatures (Atassi & GlavinoviÂc, 1999). 6 Fast-Application Protocol on Excised Membrane Patch as Surrogate Synapse Since it is difcult to examine the steps that are rate limiting at an intact synapse, fast glutamate applications to excised outside-out membrane patches are often used as a surrogate synapse (Colquhoun et al., 1992; Raman & Trussell, 1992; Hestrin, 1992, 1993; Trussell et al., 1993; Livsey et al., 1993; Jonas et al., 1994). This protocol is attractive because it permits the investigator to control not only postsynaptic but also presynaptic (i.e., release) param-
12
Mladen I. GlavinoviÂc
eters. One can thus systematically examine how the glutamate concentration and its pulse length inuence the time course of the postsynaptic currents. For at least three reasons, the surrogate synapse often gives a misleading picture of synaptic events (GlavinoviÂc & Rabie, 1998). First, the time course of transmitter in the synaptic cleft is never a square pulse; it has a more complicated shape, and the concentration is spatially nonuniform (Ogston, 1955; Holmes, 1995; Wahl et al., 1996; Kleinle et al., 1996; GlavinoviÂc & Rabie, 1998; GlavinoviÂc, 1999). Second, in such a system, the binding and unbinding of transmitter from receptors do not affect the concentration of transmitter that activates the responses. In the synaptic cleft, accumulations (and depletions) of transmitter can occur because of such binding and unbinding, and they can exert a profound inuence on the time course of the postsynaptic current (Mennerick & Zorumski, 1994, 1995; GlavinoviÂc & Rabie, 1998). They are likely to be especially relevant if the occupancy of the AMPA receptors in the single bound state is more than negligible. Third, the rate of onset of intrinsic desensitization differs from that of entry into the desensitized states. Unlike the time course of intrinsic desensitization, the occupancy of the desensitized states during a synaptic current clearly is concentration dependent. 7 Pharmacological Inhibition of Desensitization of AMPA Receptor Channels A variety of agents (such as aniracetam or cyclothiazide) are known to inhibit the desensitization of AMPA receptor channels (Ito, Tanabe, Kohda, & Sugiyama, 1990; Yamada & Tang, 1993). They produce a slower decay of the spontaneous and evoked postsynaptic potentials (Larson, Le, Hall, & Lynch, 1994) and currents (Isaacson & Nicoll, 1991; Tang et al., 1991; Trussell et al., 1993; Partin et al., 1993; Ghamari-Langroudi & GlavinoviÂc, 1998; Atassi & GlavinoviÂc, 1999), which is often taken as an argument that desensitization plays an important role in shaping the time course of postsynaptic currents. This interpretation is not universally accepted on the grounds that these agents also make the kinetics of deactivation slower (Jonas & Spruston, 1994; Edmonds et al., 1995). Such criticisms may not be valid because deactivation should be slower when desensitization is suppressed, though there is no slowing of the rate of closing of AMPA channels (GlavinoviÂc & Rabie, 1998). A strong correlation between the time constants of deactivation and of desensitization, observed in a variety of central synapses (Partin, Fleck, & Mayer, 1996), suggests that deactivation is slower largely (and probably entirely) owing to reduced desensitization. 8 Time Course of Glutamate Concentration in the Synaptic Cleft Albeit of considerable interest, the time course of glutamate concentration is not known with precision. Given that diffusion is fundamentally a mul-
Desensitization and Excitatory Synaptic Currents
13
tiexponential process (Carslaw & Jaeger, 1959), the time course of transmitter concentration in the synaptic cleft should decay multiexponentially (Ogston, 1955; Eccles & Jaeger, 1958; Wathey, Nass, & Lester, 1979; Silver et al., 1996; Clements, 1996). At least two exponentials are expected even when considering a two-dimensional extracellular space (Destexhe & Sejnowski, 1995). The time constant of the fast initial concentration decay (t f ) is expected to be governed by lateral diffusion and determined primarily by the diffusion coefcient (D) of the transmitter and by the cleft radius (rI t f D r2 / 4D; Bartol & Sejnowski, 1993; Destexhe & Sejnowski, 1995; Clements, 1996; Silver et al., 1996). This has been conrmed by recent Monte Carlo simulations (Bartol et al., 1991; Wahl et al., 1996; GlavinoviÂc & Rabie, 1998). Fitting four exponentials to the time course of the concentration of cleft glutamate simulated assuming an instantaneous point source of glutamate (see Figure 4B) yields a very fast initial time constant (2–6 m s) that is independent of the number of glutamate molecules released (300–10,000), the presence or absence of 196 receptors on the 200 £ 200 nm postsynaptic membrane, or their ability to desensitize. Though this value is clearly smaller than values reported before (50–200 m s; Eccles & Jaeger, 1958; Burger et al., 1989; Faber, Young, Legendre, & Korn, 1992; Bartol & Sejnowski, 1993), the agreement between them is good when the differences of the cleft radius are taken into account (0.25–0.5 m m as opposed to 0.1 m m). The cleft diameters of excitatory synaptic terminals on CA1 pyramidal neurones are variable ranging from 0.1 to 1.0 m m (Palay & Chan-Palay, 1974; Bekkers, Richerson, & Stevens, 1990). Within 50 m s, the transmitter concentration is essentially spatially uniform throughout the synapse (Wahl et al., 1996; Clements, 1996; GlavinoviÂc, 1999). The amplitudes and the time constants of the slower components, however, depend not only on the diffusion constant and cleft geometry but also on receptor density and its kinetics (Wahl et al., 1996; GlavinoviÂc & Rabie, 1998; see Figure 4B). The longest time constant ranges from 30 m s to 1.2 ms, values comparable to earlier estimates for the slow component (Clements, 1996). Its amplitude is 1.5 to 15.0% of that for the fastest component. The effective diffusion constant of glutamate in the synaptic cleft inuences all components of the time course of glutamate concentration: if the rate of diffusion is lower than in simple aqueous solutions (Longsworth, 1953), the glutamate concentration will change more slowly (Vogt, Luscher, & Streit, 1995; Kleinle et al., 1996). In contrast to the neuromuscular junction, a high-turnover enzyme for transmitter degradation is absent from the synaptic cleft of glutamatergic synapses. This would tend to make the time course of glutamate in the cleft relatively long. However, a signicant compensating factor is the presence of transmitter-selective transporters (not found at cholinergic synapses), which affects the time course of glutamate (Tong & Jahr, 1994b; Takahashi, Sarantis, & Attwell, 1996), especially if, as evidence suggests, the transporters are at a high density, surpassing that of AMPA receptor channels by
14
Mladen I. GlavinoviÂc
more than an order of magnitude (Arriza et al., 1994; Otis, Kavanaugh, & Jahr, 1997; Diamond & Jahr, 1997). The rate of binding of glutamate to the transporters appears to be at least as rapid as binding to the AMPA receptors; the rate of glutamate unbinding is low, and its turnover rate is very low. Therefore, on the timescale of the fast excitatory postsynaptic current, the binding of glutamate to the transporters is essentially irreversible, and it will speed up the clearance of glutamate from the cleft. The time course of release of the vesicular content will also inuence the time course of glutamate concentration in the synaptic cleft. The short rise time of well-clamped, fast, spontaneous EPSCs (Stern et al., 1992; Trussell et al., 1993; Jonas et al., 1993; Spruston, Jonas, & Sakmann, 1995) would argue for a rapid onset of the release of the vesicular content and a rapid subsequent rise of glutamate concentration in the synaptic cleft. However, the time course of release of vesicular content appears to be highly variable, judging by the marked variability in rate of rise of mEPSCs, even when all are equally well clamped (Bekkers & Stevens, 1996; Ghamari-Langroudi & GlavinoviÂc, 1998; Atassi & GlavinoviÂc, 1999). This is not surprising because the intravesicular concentrations of glutamate and vesicular sizes differ considerably. The estimates of the intravesicular glutamate concentration range from 60 to 210 mM (Burger et al., 1989; Riveros, Fiedler, Lagos, Munoz, ˜ & Orrego, 1986; Shupliakov, Brodin, Cullheim, Ottersen, & Storm-Mathisen, 1992) and those of the vesicle diameters from 25 to 45 nm (Palay & Chan-Palay, 1974; Bekkers et al., 1990). Assuming identical fusion pore geometry and time course of opening, larger vesicles are expected to release their contents more slowly than smaller vesicles (GlavinoviÂc, 1999). The amplitude dependence of rise times of mEPSCs recorded in the presence and especially the absence of desensitization (Ghamari-Langroudi & GlavinoviÂc, 1998; Atassi & GlavinoviÂc, 1999) supports such a conclusion. Any variability of the geometry of the fusion pore and its time course of opening would provide another important modulating inuence on the rate of release of vesicular content. Finally, according to several recent reports on neuroendocrine secretory cells, agents are not released by simple diffusion, but rst have to dissociate from the gel matrix onto which they are stored at high concentration (Helle, 1990; Yoo & Lewis, 1993; Schroeder, Jankowski, Senyshyn, Holz, & Wightman, 1994; Walker, GlavinoviÂc & Trifaro, 1996). By slowing the release process, similar mechanism of storing glutamate in vesicles would provide an additional variable in the process of release of vesicular content. 9 Conclusion According to the revised hypothesis presented in this review, desensitization of AMPA receptors plays a major role in shaping the time course of fast excitatory postsynaptic currents. This hypothesis differs from the classical desensitization hypothesis in several important respects and proposes that: (1) the glutamate concentration decays faster in the presence of desensiti-
Desensitization and Excitatory Synaptic Currents
15
zation of AMPA channels (see Figure 5), (2) desensitization shapes the time course of mEPSCs to a large extent through its effect on buffering of glutamate molecules by AMPA receptors, (3) the occupancy of the single bound activatable state is an important determinant of both the level of glutamate buffering and the efcacy of glutamate in opening AMPA channels, (4) during an mEPSC, the occupancy of all desensitized states (taken individually or all together) and of all other states, as well as the extent of glutamate buffering, are concentration dependent, and (5) desensitization is concentration dependent largely because it abolishes the glutamate buffering (a concentration-dependent process). Thus, the revised hypothesis accounts for numerous ndings on fast mEPSCs. It is expected to be equally fruitful as a framework for further experimental and theoretical investigations. Acknowledgments K. Krnjevic read the manuscript and made valuable comments. This work was supported by the Operating Grant of the Medical Research Council of Canada. References Arriza, J. L., Fairman, W. A., Wadiche, J. L., Murdoch, G. H., Kavanaugh, M. P., & Amara, S. G. (1994). Functional comparisons of three glutamate transporter subtypes cloned from human motor cortex. Journal of Neuroscience, 14, 5559– 5569. Atassi, B., & GlavinoviÂc, M. I. (1999). Effect of cyclothiazide on spontaneous miniature excitatory postsynaptic currents in rat hippocampal pyramidal cells. Pugers Archiv, 437, 471–478. Barbour, B., Keller, B. U., Llano, I., & Marty, A. (1994). Prolonged presence of glutamate during excitatory synaptic transmission to cerebellar Purkinje cells. Neuron, 12, 1331–1343. Bartol, T. M., Jr., Land, B. R., Salpeter, E. E., & Salpeter, M. M. (1991). Monte Carlo simulation of miniature endplate current generation in the vertebrate neuromuscular junction. Biophysical Journal, 59, 1290–1307. Bartol, T. M., Jr., & Sejnowski, T. J. (1993). Model of the quantal activation of NMDA receptors at a hippocampal synaptic spine. Society for Neuroscience (Abstract), 19, 1515. Bekkers, J. M., Richerson, G. B., & Stevens, C. F. (1990). Origin of variability in quantal size in cultured hippocampal neurons and hippocampal slices. Proceedings of the National Academy of Science U.S.A., 87, 5359–5362. Bekkers, J. M., & Stevens, C. F. (1996). Cable properties of cultured hippocampal neurons determined from sucrose-evoked miniature EPSCs. Journal of Neurophysiology, 75, 1250–1255. Burger, P. M., Mehl, E., Cameron, P. L., Maycox, P. R., Baumert, M., Lottspeich, F., de Camilli, P., & Jahn, R. (1989). Synaptic vesicles immunoisolated from rat cerebral cortex contain high levels of glutamate. Neuron, 3, 715–720.
16
Mladen I. GlavinoviÂc
Carslaw, H. S., & Jaeger, J. C. (1959). Conduction of heat in solids (2nd ed.). Oxford: Clarendon Press. Clements, J. D. (1996). Transmitter time course in the synaptic cleft: Its role in central synaptic function. Trends in Neuroscience, 19, 163–171. Clements, J. D., Lester, R. A. J., Tong, G., Jahr, C. E., & Westbrook, G. L. (1992). The time course of glutamate in the synaptic cleft. Science, 258, 1498–1501. Colquhoun, D., Jonas, P., & Sakmann, B. (1992). Action of brief pulses of glutamate on AMPA/kainate receptors in patches from different neurones of rat hippocampal slices. Journal of Physiology London, 458, 261–287. Destexhe, A., & Sejnowski, T. S. (1995).G protein activation kinetics and spillover of c -aminobutyric acid may account for differences between inhibitory responses in the hippocampus and thalamus. Proceedingsof the NationalAcademy of Science U.S.A., 92, 9515–9519. Diamond, J. S., & Jahr, C. E. (1997). Transporters buffer synaptically released glutamate on a submillisecond time scale. Journal of Neuroscience, 17, 4672– 4687. Eccles, J. C., & Jaeger, J. C. (1958).The relationship between the mode of operation and the dimensions of the junctional regions at synapses and motor endorgans. Proceedings of the Royal Society B, 148, 38–56. Edmonds, B., Gibb, A. J., & Colquhoun, D. (1995). Mechanisms of activation of glutamate receptors and the time course of excitatory synaptic currents. Annual Review of Physiology, 57, 495–519. Faber, D. S., Young, W. S., Legendre, P., & Korn, H. (1992). Intrinsic quantal variability due to stochastic properties of receptor-transmitter interactions. Science, 258, 1494–1498. Forti, L., Bossi, M., Bergamaschi, A., Villa, A., & Malgaroli, A. (1997).Loose-patch recordings of single quanta at individual hippocampal synapses. Nature, 388, 874–878. Ghamari-Langroudi, M., & GlavinoviÂc, M. I. (1998). Changes of spontaneous miniature excitatory postsynaptic currents in rat hippocampal pyramidal cells induced by aniracetam. Pugers Archiv, 435, 185–192. GlavinoviÂc, M. I. (1999). Monte Carlo simulation of vesicular release, spatiotemporal distribution of glutamate in synaptic cleft, and generation of postsynaptic currents. Pugers Archiv, 437, 462–470. GlavinoviÂc, M. I., & Rabie, H. R. (1998). Monte Carlo simulation of spontaneous miniature excitatory postsynaptic currents in rat hippocampal synapse in the presence and in the absence of desensitization. Pugers Archiv, 435, 193–202. Hausser, M., & Roth, A. (1997). Dendritic and somatic glutamate receptor channels in rat cerebellar Purkinje cells. Journal of Physiology London, 501, 77–95. Helle, K. B. (1990). Chromogranins: Universal proteins in secretory organelles from Paramecium to man. Neurochemistry International, 17, 165–175. Hestrin, S. (1992). Activation and desensitization of glutamate-activated channels mediating fast excitatory synaptic currents in the visual cortex. Neuron, 9, 991–999. Hestrin, S. (1993). Different glutamate receptor channels mediate fast excitatory synaptic currents in inhibitory and excitatory cortical neurons. Neuron, 11, 1083–1091.
Desensitization and Excitatory Synaptic Currents
17
Hestrin, S., Sah, P., & Nicoll R. A. (1990). Mechanisms generating the time course of dual component excitatory synaptic currents recorded in hippocampal slices. Neuron, 5, 247–253. Holmes, W. R. (1995). Modeling the effect of glutamate diffusion and uptake on NMDA and non-NMDA receptor saturation. Biophysical Journal, 69, 1734– 1747. Isaacson, J. S., & Nicoll, R. A. (1991). Aniracetam reduces glutamate receptor desensitization and slows the decay of fast excitatory synaptic currents in the hippocampus. Proceedings of the National Academy of Science U.S.A., 88, 10936–10940. Isaacson, J. S., & Walmsley, B. (1996). Amplitude and time course of spontaneous and evoked excitatory postsynaptic currents in bushy cells of the anteroventral cochlear nucleus. Journal of Neurophysiology, 76, 1566–1571. Ito, I., Tanabe, S., Kohda, A., & Sugiyama, H. (1990). Allosteric potentiation of quisqualate receptors by a nootropic drug aniracetam. Journal of Physiology London, 424, 533–544. Jonas, P., Major, G., & Sakmann, B. (1993). Quantal components of unitary EPSCs at the mossy bre synapse on CA3 pyramidal cells of rat hippocampus. Journal of Physiology London, 472, 615–663. Jonas, P., Racca, C., Sakmann, B., Seeburg, P. H., & Monyer, H. (1994). Differences in Ca2 C permeability of AMPA-type glutamate receptor channels in neocortical neurons caused by differential gluR-B subunit expression. Neuron, 12, 1281–1289. Jonas, P., & Spruston, N. (1994). Mechanisms shaping glutamate-mediated excitatory postsynaptic currents in the CNS. Current Opinion in Neurobiology, 4, 366–372. Jones, M. V., Sahara, Y., Dzubay, J. A., & Westbrook, G. L. (1998). Dening afnity with the GABAA receptor. Journal of Neuroscience, 18, 8590–8604. Kleinle, J., Vogt, K., Luscher, H. R., Senn, W., Wyler, K., & Streit, J. (1996). Transmitter concentration proles in the synaptic cleft: An analytical model of release and diffusion. Biophysical Journal, 71, 2413–2426. Kruk, P. J., Horn, H., & Faber, D. S. (1997). The effects of geometrical parameters on synaptic transmission: A Monte Carlo simulation study. Biophysical Journal, 73, 2874–2890. Larson, J., Le, T. T., Hall, R. A., & Lynch, G. (1994). Effects of cyclothiazide on synaptic responses in slices of adult and neonatal rat hippocampus. Neuroreport, 5, 389–392. Liu, G., Choi, S., & Tsien, R. W. (1999). Variability of neurotransmitter concentration and nonsaturation of postsynaptic AMPA receptors at synapses in hippocampal cultures and slices. Neuron, 22, 395–409. Livsey, C. T., Costa, E., & Vicini, S. (1993). Glutamate-activated currents in outside-out patches from spiny versus aspiny hilar neurons of rat hippocampal slices. Journal of Neuroscience, 13, 5324–5333. Longsworth, L. G. (1953). Diffusion measurements at 25± C, of aqueous solutions of amino acids, peptides and sugars. Journal of the American Chemical Society, 75, 5705–5709.
18
Mladen I. GlavinoviÂc
Magleby, K. L., & Stevens, C. F. (1972). A quantitative description of end-plate currents. Journal of Physiology London, 223, 173–197. Mennerick, S., & Zorumski, C. F. (1994). Glial contributions to excitatory neurotransmission in cultured hippocampal cells. Nature, 368, 59–62. Mennerick, S., & Zorumski, C. F. (1995). Presynaptic inuence on the time course of fast excitatory synaptic currents in cultured hippocampal cells. Journal of Neuroscience, 15, 3178–3192. Ogston, A. G. (1955).Removal of acetylcholine from limited volume by diffusion. Journal of Physiology London, 128, 222–223. Otis, T. S., Kavanaugh, M. P., & Jahr, C. E. (1997). Postsynaptic glutamate transport at the climbing ber–Purkinje cell synapse. Science, 277, 1515–1518. Palay, S. L., & Chan-Palay, V. (1974). Cerebellar cortex: Cytology and organization. Berlin: Springer. Partin, K. M., Fleck, M. W., & Mayer, M. L. (1996). AMPA receptor ip / op mutants affecting deactivation, desensitization, and modulation by cyclothiazide, aniracetam, and thiocyanate. Journal of Neuroscience, 16, 6634– 6647. Partin, K. M., Patneau, D. K., Winters, C. A., Mayer, M. L., & Buonanno, A. (1993). Selective modulation of desensitization at AMPA versus kainate receptors by cyclothiazide and concanavalin A. Neuron, 11, 1069–1082. Raman, I. M., & Trussell, L. O. (1992). The kinetics of the response to glutamate and kainate in neurons of the avian cochlear nucleus. Neuron, 9, 173–186. Raman, I. M., & Trussell, L. O. (1995). The mechanism of (a-amino-3-hydroxy-5methyl-4-isoxazolepropionate receptor desensitization after removal of glutamate. Biophysical Journal, 68, 137–146. Riveros, N., Fiedler, J., Lagos, N., Munoz, ˜ C., & Orrego, F. (1986). Glutamate in rat brain cortex synaptic vesicles: Inuence of the vesicle isolation procedure. Brain Research, 386, 405–408. Rosenmund, C., Stern-Bach, Y., & Stevens, C. F. (1998). The tetrameric structure of a glutamate receptor channel. Science, 280, 1596–1599. Schroeder, T. J., Jankowski, J. A., Senyshyn, J., Holz, R. W., & Wightman, R. M. (1994). Zones of exocytotic release on bovine adrenal medullary cells in culture. Journal of Biological Chemistry, 269, 17215–17220. Shupliakov, O., Brodin, L., Cullheim, S., Ottersen, O. P., & Storm-Mathisen, J. (1992). Immunogold quantication of glutamate in two types of excitatory synapse with different ring patterns. Journal of Neuroscience, 12, 3789– 803. Silver, R. A., Colquhoun, D., Cull-Candy, S. G., & Edmonds, B. (1996). Deactivation and desensitization of non-NMDA receptors in patches and the time course of EPSCs in rat cerebellar granule cells. Journal of Physiology London, 493, 167–173. Silver, R. A., Traynelis, S. F., & Cull-Candy, S. G. (1992).Rapid-time-course miniature and evoked excitatory currents at cerebellar synapses in situ. Nature, 355, 163–166. Spruston, N., Jonas, P., & Sakmann, B. (1995). Dendritic glutamate receptor channels in rat hippocampal CA3 and CA1 pyramidal neurons. Journal of Physiology London, 482, 325–352.
Desensitization and Excitatory Synaptic Currents
19
Stern, P., Edwards, F. A., & Sakmann, B. (1992). Fast and slow components of unitary EPSCs on stellate cells elicited by focal stimulation in slices of rat visual cortex. Journal of Physiology London, 449, 247–278. Stern-Bach, Y., Russo, S., Neuman, M., & Rosenmund, C. (1998). A point mutation in the glutamate binding site blocks desensitization of AMPA receptors. Neuron, 2, 907–918. Takahashi, M., Sarantis, M., & Attwell, D. (1996). Postsynaptic glutamate uptake in rat cerebellar Purkinje cells. Journal of Physiology London, 497, 523–530. Tang, C.-M., Shi, Q.-Y., Katchman, A., & Lynch, G. (1991). Modulation of the time course of fast EPSCs and glutamate channel kinetics by aniracetam. Science, 254, 288–290. Tong, G., & Jahr, C. E. (1994a). Multivesicular release from excitatory synapses of cultured hippocampal neurons. Neuron, 12, 51–59. Tong, G., & Jahr, C. E. (1994b). Block of glutamate transporters potentiates postsynaptic excitation. Neuron, 13, 1195–1203. Trussell, L. O., & Fischbach, G.D. (1989). Glutamate receptor desensitization and its role in synaptic transmission. Neuron, 3, 209–218. Trussell, L. O., Thio, L. L., Zorumski, C. F., & Fischbach, G. D. (1988). Rapid desensitization of glutamate receptors in vertebrate central neurons. Proceedings of the National Academy of Science U.S.A., 85, 2834–2838. Trussell, L. O., Zhang, S., & Raman, I. M. (1993). Desensitization of AMPA receptors upon multiquantal neurotransmitter release. Neuron, 10, 1185–1196. Vogt, K., Luscher, H. R., & Streit, J. (1995). Analysis of synaptic transmission at single identied boutons on rat spinal neurons in culture. Pugers Archiv, 430, 1022–1028. Vyklicky, L., Patneau, D. K., & Mayer, M. L. (1991). Modulation of excitatory synaptic transmission by drugs that reduce desensitization at AMPA/kainate receptors. Neuron, 7, 971–984. Wahl, L. M., Pouzat, C., & Stratford, K. J. (1996). Monte Carlo simulation of fast excitatory synaptic transmission at a hippocampal synapse. Journal of Neurophysiology, 75, 597–608. Wall, M. J., & Usowicz, M. M. (1998). Development of the quantal properties of evoked and spontaneous synaptic currents at a brain synapse. Nature Neuroscience, 1, 675–682. Walker, A., GlavinoviÂc, M. I., & Trifaro, J-M. (1996). Time course of release of content of single vesicles in bovine chromafn cells. Pugers Archiv, 431, 729– 735. Wathey, J. C., Nass, M. M., & Lester, H. A. (1979). Numerical reconstruction of the quantal event at nicotinic synapses. Biophysical Journal, 27, 145–164. Yamada, K. A., & Tang, C.-M. (1993). Benzothiadiazides inhibit rapid glutamate receptor desensitization and enhance glutamatergic synaptic currents. Journal of Neuroscience, 13, 3904–3915. Yoo, S. H., & Lewis, M. S. (1993). Dimerization and tetramerization properties of the C-terminal chromogranin A: A thermodynamic analysis. Biochemistry, 32, 8816–8219. Received December 12, 2000; accepted March 27, 2001.
NOTE
Communicated by Leo Breiman
Adjusting the Outputs of a Classier to New a Priori Probabilities: A Simple Procedure Marco Saerens
[email protected] IRIDIA Laboratory, cp 194/6, UniversitÂe Libre de Bruxelles, B-1050 Brussels, Belgium, and SmalS-MvM, Research Section, Brussels, Belgium
Patrice Latinne
[email protected] IRIDIA Laboratory, cp 194/6, UniversitÂe Libre de Bruxelles, B-1050 Brussels, Belgium
Christine Decaestecker
[email protected] Laboratory of Histopathology, cp 620, UniversitÂe Libre de Bruxelles, B-1070 Brussels, Belgium
It sometimes happens (for instance in case control studies) that a classier is trained on a data set that does not reect the true a priori probabilities of the target classes on real-world data. This may have a negative effect on the classication accuracy obtained on the real-world data set, especially when the classier’s decisions are based on the a posteriori probabilities of class membership. Indeed, in this case, the trained classier provides estimates of the a posteriori probabilities that are not valid for this realworld data set (they rely on the a priori probabilities of the training set). Applying the classier as is (without correcting its outputs with respect to these new conditions) on this new data set may thus be suboptimal. In this note, we present a simple iterative procedure for adjusting the outputs of the trained classier with respect to these new a priori probabilities without having to ret the model, even when these probabilities are not known in advance. As a by-product, estimates of the new a priori probabilities are also obtained. This iterative algorithm is a straightforward instance of the expectation-maximization (EM) algorithm and is shown to maximize the likelihood of the new data. Thereafter, we discuss a statistical test that can be applied to decide if the a priori class probabilities have changed from the training set to the real-world data. The procedure is illustrated on different classication problems involving a multilayer neural network, and comparisons with a standard procedure for a priori probability estimation are provided. Our original method, based on the EM algorithm, is shown to be superior to the standard one for a priori probability estimation. Experimental results also indicate that the classier with adjusted outputs always performs better than the original one in Neural Computation 14, 21–41 (2001)
° c 2001 Massachusetts Institute of Technology
22
M. Saerens, P. Latinne, and C. Decaestecker
terms of classication accuracy, when the a priori probability conditions differ from the training set to the real-world data. The gain in classication accuracy can be signicant. 1 Introduction In supervised classication tasks, sometimes the a priori probabilities of the classes from a training set do not reect the “true” a priori probabilities of real-world data, on which the trained classier has to be applied. For instance, this happens when the sample used for training is stratied by the value of the discrete response variable (i.e., the class membership). Consider, for example, an experimental setting—a case control study—where we select 50% of individuals suffering from a disease (the cases) and 50% of individuals who do not suffer from this disease (the controls), and suppose that we make a set of measurements on these individuals. The resulting observations are used in order to train a model that classies the data into the two target classes: disease and no disease. In this case, the a priori probabilities of the two classes in the training set are 0.5 each. Once we apply the trained model in a real-world situation (new cases), we have no idea of the true a priori probability of disease (also labeled “disease prevalence” in biostatistics). It has to be estimated from the new data. Moreover, the outputs of the model have to be adjusted accordingly. In other words, the classication model is trained on a data set with a priori probabilities that are different from the real-world conditions. In this situation, knowledge of the “true” a priori probabilities of the real-world data would be an asset for the following important reasons: Optimal Bayesian decision making is based on the a posteriori probabilities of the classes conditional on the observation (we have to select the class label that has maximum estimated a posteriori probability). Now, following Bayes’ rule, these a posteriori probabilities depend in a nonlinear way on the a priori probabilities. Therefore, a change of the a priori probabilities (as is the case for the real-world data versus the training set) may have an important impact on the a posteriori probabilities of membership, which themselves affect the classication rate. In other words, even if we use an optimal Bayesian model, if the a priori probabilities of the classes change, the model will not be optimal anymore in these new conditions. But knowing the new a priori probabilities of the classes would allow us to correct (by Bayes’ rule) the output of the model in order to recover the optimal decision. Many classication methods, including neural network classiers, provide estimates of the a posteriori probabilities of the classes. From the previous point, this means that applying such a classier as is on new data having different a priori probabilities from the training set can result in a loss of classication accuracy, in comparison with an equiv-
Adjusting a Classier to New a Priori Probabilities
23
alent classier that relies on the “true” a priori probabilities of the new data set. This is the primary motivation of this article: to introduce a procedure allowing the correction of the estimated a posteriori probabilities, that is, the classier’s outputs, in accordance with the new a priori probabilities of the real-world data, in order to make more accurate predictions, even if these a priori probabilities of the new data set are not known in advance. As a by-product, estimates of the new a priori probabilities are also obtained. The experimental section, section 4, will conrm that a signicant increase in classication accuracy can be obtained when correcting the outputs of the classier with respect to new a priori probability conditions. For the sake of completeness, notice also that there exists another approach, the min-max criterion, which avoids the estimation of the a priori probabilities on the new data. Basically, the min-max criterion says that one should use the Bayes decision rule, which corresponds to the least favorable a priori probability distribution (see, e.g., Melsa & Cohn, 1978, or Hand, 1981). In brief, we present a simple iterative procedure that estimates the new a priori probabilities of a new data set and adjusts the outputs of the classier, which is supposed to approximate the a posteriori probabilities, without having to ret the model (section 2). This algorithm is a simple instance of the expectation-maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977; McLachlan & Krishnan, 1997), which aims to maximize the likelihood of the new observed data. We also discuss a simple statistical test (a likelihood ratio test) that can be applied in order to decide if the a priori probabilities have changed or not from the training set to the new data set (section 3). We illustrate the procedure on articial and real classication tasks and analyze its robustness with respect to imperfect estimation of the a posteriori probabilities provided by the classier (section 4). Comparisons with a standard procedure used for a priori probabilities estimation (also in section 4) and a discussion with respect to the related work (section 5) are also provided. 2 Correcting a Posteriori Probability Estimates with Respect to New a Priori Probabilities 2.1 Data Classication. One of the most common uses of data is classication. Suppose that we want to forecast the unknown discrete value of a dependent (or response) variable ! based on a measurement vector—or observation vector—x. This discrete dependent variable takes its value in V D ( v1 , . . . , vn ) —the n class labels. A training example is therefore a realization of a random feature vector, x, measured on an individual and allocated to one of the n classes 2 V. A training set is a collection of such training examples recorded for
24
M. Saerens, P. Latinne, and C. Decaestecker
the purpose of model building (training) and forecasting based on that model. The a priori probability of belonging to class vi in the training set will be denoted as pt (vi ) (in the sequel, subscript t will be used for estimates carried out on the basis of the training set). In the case control example, pt (v1 ) D pt ( disease) D 0.5, and pt (v2 ) D pt ( no disease) D 0.5. For the purpose of training, we suppose that for P each class vi , observations on Nti individuals belonging to the class (with niD 1 Nti D Nt , the total number of training examples) have been independently recorded according to the within-class probability density, p (x | vi ) . Indeed, case control studies involve direct sampling from the within-class probability densities, p (x | vi ) . In a case control study with two classes (as reported in section 1), this means that we made independent measurements on Nt1 individuals who contracted the disease (the cases), according to p ( x |disease), and on Nt2 individuals who did not (the controls), according to p( x | no disease) . The a priori probabilities of the classes in the training set are therefore estimated by their frequencies b pt ( vi ) D Nti / Nt . Let us suppose that we trained a classication model (the classier), and denote by b pt (vi | x) the estimated a posteriori probability of belonging to class vi provided by the classier, given that the feature vector x has been observed, in the conditions of the training set. The classication model (whose parameters are estimated on the basis of the training set as indicated by subscript t) could be an articial neural network, a logistic regression, or any other model that provides as output estimates of the a posteriori probabilities of the classes given the observation. This is, for instance, the case if we use the least-squares error or the Kullback-Leibler divergence as a criterion for training and if the minimum of the criterion is reached (see, e.g., Richard & Lippmann, 1991, or Saerens, 2000, for a recent discussion). We therefore suppose that the model has n outputs, gi ( x) (i D 1, . . . , n), providing estimated posterior probabilities of membership b pt ( vi | x) D gi ( x). In the experimental section (section 4), we will show that even imperfect approximations of these output probabilities allow reasonably good outputs corrections by the procedure to be presented below. Let us now suppose that the trained classication model has to be applied on another data set (new cases, e.g., real-world data to be scored) for which the class frequencies, estimating the a priori probabilities p (vi ) (no subscript t), are suspected to be different from b pt (vi ) . The a posteriori probabilities provided by the model for these new cases will have to be corrected accordingly. As detailed in the two next sections, two cases must be considered according to the fact that estimates of the new a priori probabilities b p (vi ) are, or are not, available for this new data set. 2.2 Adjusting the Outputs to New a Priori Probabilities: New a Priori Probabilities Known. In the sequel, we assume that the generation of the observations within the classes, and thus the within-class densities, p (x | vi ) ,
Adjusting a Classier to New a Priori Probabilities
25
does not change from the training set to the new data set (only the relative proportion of measurements observed from each class has changed). This is a natural requirement; it supposes that we choose the training set examples only on the basis of the class labels vi , not on the basis of x. We also assume that we have an estimate of the new a priori probabilities, b p ( vi ) . Suppose now that we are working on a new data set to be scored. Bayes’ theorem provides b pt ( vi | x)b pt ( x) , b pt ( vi )
b pt ( x | vi ) D
(2.1)
where the a posteriori probabilities b pt (vi | x) are obtained by applying the trained model as is (subscript t) on some observation x of the new data set (i.e., by scoring the data). These are the estimated a posteriori probabilities in the conditions of the training set (relying on the a priori probabilities of the training set). The corrected a posteriori probabilities, b p (vi | x) (relying this time on the estimated a priori probabilities of the new data set) obey the same equation, but with b p (vi ) as the new a priori probabilities and b p ( x) as the new probability density function (no subscript t): b p (x | vi ) D
b p (vi | x)b p ( x) . b p (vi )
(2.2)
Since the within-class densities b p (x | vi ) do not change from training to real-world data (b pt ( x | vi ) D b p( x | vi ) ), by equating equation (2.1) to (2.2) and dening f ( x) D b pt ( x) / b p (x) , we nd b p (vi | x) D f ( x)
b p( vi ) b pt ( vi | x) . b pt ( vi )
(2.3)
Xn
b p ( vi | x) D 1, we easily obtain 3 ¡1 n b( v ) X p j b f ( x) D 4 pt ( vj | x) 5 , b pt ( vj )
Since
iD 1
2
j D1
and consequently b p( vi ) b pt ( vi | x) b pt ( vi ) b . p (vi | x) D n X b p ( vj ) b (v | x) pt j b pt (vj )
(2.4)
jD 1
This well-known formula can be used to compute the corrected a posteriori probabilities, b p( vi | x) in terms of the outputs provided by the trained
26
M. Saerens, P. Latinne, and C. Decaestecker
model, gi (x) D b pt ( vi | x), and the new priors b p( vi ) . We observe that the new a posteriori probabilities b p (vi | x) are simply the a posteriori probabilities in the conditions of the training set, b pt ( vi | x), weighted by the ratio of the new priors to the old priors, b p( vi ) /b pt ( vi ). The denominator of equation 2.4 ensures that the corrected a posteriori probabilities sum to one. However, in many real-world cases, we ignore what the real-world a priori probabilities p( vi ) are since we do not know the class labels for these new data. This is the subject of the next section. 2.3 Adjusting the Outputs to New a Priori Probabilities: New a Priori Probabilities Unknown. When the new a priori probabilities are not known in advance, we cannot use equation 2.4, and the p ( vi ) probabilities have to be estimated from the new data set. In this section, we present an already known standard procedure used for new a priori probability estimation (the only one available in the literature to our knowledge); then we introduce our original method based on the EM algorithm. 2.3.1 Method 1: Confusion Matrix. The standard procedure used for a priori probabilities estimation is based on the computation of the confusion matrix, b p (di | vj ) , an estimation of the probability of taking the decision di to classify an observation in class vi , while in fact it belongs to class vj (see, e.g., McLachlan, 1992, or McLachlan & Basford, 1988). In the sequel, this method will be referred to as the confusion matrix method. Here is its rationale. First, the confusion matrix b pt (di | vj ) is estimated on the training set from cross-tabulated classication frequencies provided by the classier. Once this confusion matrix has been computed on the training set, it is used in order to infer the a priori probabilities on a new data set by solving the following system of n linear equations, b p(di ) D
n X jD 1
b pt (di | vj )b p( vj ) ,
i D 1, . . . , n,
(2.5)
with respect to the b p ( vj ), where the b p (di ) is simply the marginal of classifying an observation in class vi , estimated by the class label frequency after application of the classier on the new data set. Once the b p (vj ) are computed from equation 2.5, we use equation 2.4 to infer the new a posteriori probabilities. 2.3.2 Method 2: EM Algorithm. We now present a new procedure for a priori and a posteriori probabilities adjustment, based on the EM algorithm (Dempster et al., 1977; McLachlan & Krishnan, 1997). This iterative algorithm increases the likelihood of the new data at each iteration until a local maximum is reached. Once again, let us suppose that we record a set of N new independent (x1 , x2 , . . . , xN ) , sampled from realizations of the stochastic variable x, XN 1 D
Adjusting a Classier to New a Priori Probabilities
27
p ( x) , in a new data set to be scored by the model. The likelihood of these new observations is dened as L (x1 , x2 , . . . , xN ) D D D
N Y
p( x k )
kD 1
" N n Y X kD 1
iD 1
kD 1
iD 1
" N n Y X
# p( x k, vi ) # p( x k | vi ) p (vi ) ,
(2.6)
where the within-class densities—that is, the probabilities of observing x k given class vi —remain the same (p( x k | vi ) D pt ( xk | vi ) ) since we assume that only the a priori probabilities change from the training set to the new data set. We have to determine the estimates b p( vi ) that maximize the likelihood (2.6) with respect to p( vi ) . While a closed-form solution to this problem cannot be found, we can obtain an iterative procedure for estimating the new p ( vi ) by applying the EM algorithm. As before, let us dene gi ( x k ) as the model’s output value corresponding to class vi for the observation x k of the new data set to be scored. The model outputs provide an approximation of the a posteriori probabilities of the classes given the observation in the conditions of the training set (subscript t), while the a priori probabilities of the training set are estimated by class frequencies: b pt ( vi | x k ) D gi ( xk ) b pt ( vi ) D
Nti . Nt
(2.7) (2.8)
Let us dene as b p( s ) ( vi ) and b p ( s ) (vi | x k ) the estimations of the new a priori and a posteriori probabilities at step s of the iterative procedure. If the b p ( s) ( vi ) are initialized by the frequencies of the classes in the training set (see equation 2.8), the EM algorithm provides the following iterative steps (see the appendix) for each new observation x k and each class vi : b p( 0) (vi ) D b pt (vi ) b p ( s) ( vi ) b pt ( vi | x k ) b pt ( vi ) b p ( s) (vi | xk ) D n ( s) Xb p ( vj ) b pt ( vj | x k ) b pt (vj ) D1 j
b p ( s C 1) (vi ) D
N 1 X b p( s ) (vi | x k ) , N kD 1
(2.9)
28
M. Saerens, P. Latinne, and C. Decaestecker
where b pt ( vi | x k ) and b pt ( vi ) are given by equations 2.7 and 2.8. Notice the similarity between equations 2.4 and 2.9. At each iteration step s, both the a posteriori (b p ( s) ( vi | x k ) ) and the a priori probabilities (b p ( s) ( vi ) ) are reestimated sequentially for each new observation x k and each class vi . The iterative procedure proceeds until the convergence of the estimated probabilities b p( s ) (vi ) . Of course, if we have some a priori knowledge about the values of the prior probabilities, we can take these starting values for the initialization of the b p( 0) (vi ) . Notice also that although we did not encounter this problem in our simulations, we must keep in mind that local maxima problems potentially may occur (the EM algorithm nds a local maximum of the likelihood function). In order to obtain good a priori probability estimates, it is necessary that the a posteriori probabilities relative to the training set are reasonably well approximated (i.e., sufciently well estimated by the model). The robustness of the EM procedure with respect to imperfect a posteriori probability estimates will be investigated in the experimental section (section 4). 3 Testing for Different A Priori Probabilities In this section, we show that the likelihood ratio test can be used in order to decide if the a priori probabilities have signicantly changed from the training set to the new data set. Before adjusting the a priori probabilities (when the trained classication model is simply applied to the new data), the likelihood of the new observations is
Lt ( x1 , x2 , . . . , xN ) D
N Y
b pt (x k )
kD 1
¶ N µ Y b p( x k | vi )b pt (vi ) D , b pt ( vi | xk )
(3.1)
kD 1
whatever the class label vi , and where we used the fact that pt ( x k | vi ) D p (x k | vi ). After the adjustment of the a priori and a posteriori probabilities, we compute the likelihood in the same way:
L ( x1 , x2 , . . . , xN ) D D
N Y
b p( xk )
kD 1
¶ N µ Y b p (x k | vi )b p ( vi ) , b p( vi | x k ) D1 k
(3.2)
Adjusting a Classier to New a Priori Probabilities
29
so that the likelihood ratio is µ ¶ N Q b p( x k | vi )b p ( vi ) b p ( vi | x k ) L( x 1 , x 2 , . . . , x N ) kD 1 D µ ¶ N ( ) Lt x 1 , x 2 , . . . , x N Q b p (x k | vi )b pt (vi ) b pt ( vi | x k ) kD 1 µ ¶ N Q b p( vi ) p( vi | x k ) kD 1 b D µ ¶ N Q b pt ( vi ) , pt ( vi | x k ) kD 1 b
(3.3)
and the log-likelihood ratio is µ
L( x 1 , x 2 , . . . , x N ) log Lt ( x 1 , x 2 , . . . , x N )
¶ D
N X kD 1
N £ ¤ X £ ¤ log b log b pt ( vi | x k ) ¡ p (vi | xk ) kD 1
£ ¤ £ ¤ C N log b p( vi ) ¡ N log b pt (vi ) .
(3.4)
From standard statistical inference (see, e.g., Mood, Graybill, & Boes, 1974; Papoulis, 1991), 2 £ log [L( x1 , x2 , . . . , xN ) / Lt (x1 , x2 , . . . , xN ) ] is asymptotically distributed as a chi square with ( n ¡ 1) degrees of freedom (Â(2n¡1) , P where n is the number of classes). Indeed, since niD 1 b p( vi ) D 1 , there are only ( n ¡ 1) degrees of freedom. This allows us to test if the new a priori probabilities differ signicantly from the original ones and thus to decide if the a posteriori probabilities (i.e., the model outputs) need to be corrected. Notice also that standard errors on the estimated a priori probabilities can be obtained through the computation of the observed information matrix, as detailed in McLachlan & Krishnan, 1997. 4 Experimental Evaluation 4.1 Simulations on Articial Data. We present a simple experiment that illustrates the iterative adjustment of the a priori and a posteriori probabilities. We chose a conventional multilayer perceptron (with one hidden layer, softmax output functions, trained with the Levenberg-Marquardt algorithm) as a classication model, as well as a database labeled Ringnorm, introduced by Breiman (1998). 1 This database is constituted of 7400 cases described by 20 numerical features and divided into two equidistributed classes (each drawn from a multivariate normal distribution with a different variance-covariance matrix). 1
Available online at http://www.cs.utoronto.ca/˜delve/data/datasets.html.
30
M. Saerens, P. Latinne, and C. Decaestecker
Table 1: Results of the Estimation of Priors on the Test Sets, Averaged on 10 Runs, Ringnorm Articial Data Set. True Priors
10% 20% 30% 40% 50% 60% 70% 80% 90%
Estimated Prior by Using
Log-Likelihood Ratio Test
EM
Confusion Matrix
Number of Times the Test Is Signicant
14.7% 21.4 33.0 42.5 49.2 59.0 66.8 77.3 85.6
18.1% 24.2 34.4 42.7 49.0 57.1 64.8 73.9 80.9
10 10 10 10 0 10 10 10 10
Note: The neural network has been trained on a data set with a priori probabilities of (50%, 50%).
Ten replications of the following experimental design were applied. First, a training set of 500 cases of each class was extracted from the data (pt ( v1 ) D pt (v2 ) D 0.50) and was used for training a neural network with 10 hidden units. For each training set, nine independent test sets of 1000 cases were selected according to the following a priori probability sequence: p (v1 ) D 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90 (with p ( v2 ) D 1 ¡p(v1 ) ). Then, for each test set, the EM procedure (see equation 2.9), as well as the confusion matrix procedure (see equation 2.5), were applied in order to estimate the new a priori probabilities and adjust the a posteriori probabilities provided by the model (b pt ( v1 | x) D g( x) ). In each experiment, a maximum of ve iteration steps of the EM algorithm was sufcient to ensure the convergence of the estimated probabilities. Table 1 shows the estimated a priori probabilities for v1 . With respect to the EM algorithm, it also shows the number of times the likelihood ratio test was signicant at p < 0.01 on these 10 replications. Table 2 presents the classication rates (computed on the test set) before and after the probability adjustments, as well as when the true priors of the test set (p( vi ), which are unknown in a real-world situation) were used to adjust the classier’s outputs (using equation 2.4). This latter result can be considered an optimal reference in this experimental context. The results reported in Table 1 show that the EM algorithm was clearly superior to the confusion matrix method for a priori probability estimation and that the a priori probabilities are reasonably well estimated. Except in the cases where p ( vi ) D pt ( vi ) D 0.50, the likelihood ratio test revealed in each replication a signicant difference (at p < 0.01) between the training 6 b and the test set a priori probabilities (b pt ( vi ) D p( vi ) ). The a priori estimates appeared as slightly biased toward 50%; this appears as a bias affecting the neural network classier trained on an equidistributed training set.
Adjusting a Classier to New a Priori Probabilities
31
Table 2: Classication Rates on the Test Sets, Averaged on 10 Runs, Ringnorm Articial Data Set. True Priors
Percentage of Correct Classication No
10% 20% 30% 40% 50% 60% 70% 80% 90%
After Adjustment by Using
Adjustment
EM
Confusion Matrix
True Priors
90.1% 90.3 88.6 90.4 87.0 90.0 89.2 89.5 88.5
93.6% 91.9 89.9 90.4 86.9 90.0 89.8 90.7 91.6
93.1% 91.7 89.8 90.4 86.8 90.0 89.7 90.7 91.3
94.0% 92.2 90.0 90.6 87.0 90.0 90.2 91.0 92.0
By looking at Table 2 (classication results), we observe that the impact of the adjustment of the outputs on classication accuracy can be signicant. The effect was most benecial when the new a priori probabilities, p ( vi ), are far from the training set ones (pt ( vi ) D 0.50). Notice that in each case, the classication rates obtained after adjustment were close to those obtained by using the true a priori probabilities of the test sets. Although the EM algorithm provides better estimates of the a priori probabilities, we found no difference between the EM algorithm and the confusion matrix method in terms of classication accuracy. This could be due to the high recognition rates observed for this problem. Notice also that we observe a small degradation in classication accuracy if we adjust the a priori probabilities when not necessary (pt ( vi ) D p( vi ) D 0.5), as indicated by the likelihood ratio test. 4.2 Robustness Evaluation on Articial Data. This section investigates the robustness of the EM-based procedure with respect to imperfect estimates of the a posteriori probability values provided by the classier, as well as to the size of the training and the test set (the test set alone is used to estimate the new a priori probabilities). In order to degrade the classier outputs, we gradually decreased the size of the training set in steps. Symmetrically, in order to reduce the amount of data available to the EM and the confusion matrix algorithms, we also gradually decreased the size of the test set. For each condition, we compared the classier outputs with those obtained with a Bayesian classier based on the true data distribution (which is known for an articial data set such as Ringnorm). We were thus able to quantify the error level of the classier with respect to the true a posteriori probabilities (How far is our neural network from the Bayesian classier?) and to evaluate the effects of a decrease in the training or test sizes on the a priori estimates provided by EM and the classication performances.
32
M. Saerens, P. Latinne, and C. Decaestecker
Table 3: Averaged Results for Estimation of the Priors, Ringnorm Data Set, Averaged on 10 Runs. Training Set Size
Test Set Size
(#v1 , #v2 )
(#v1 , #v2 )
(500, 500)
(200, 800) (100, 400) (40, 160) (20, 80) (200, 800) (100, 400) (40, 160) (20, 80) (200, 800) (100, 400) (40, 60) (20, 80) (200, 800) (100, 400) (40, 160) (20, 80)
(250, 250)
(100, 100)
(50, 50)
Estimated Prior for v1 (p(v1 ) D 0.20) by Using
Mean Absolute Deviation 1 N
PN kD 1
| b(xk ) ¡ g(x k ) | 0.107 0.110 0.104 0.122 0.139 0.140 0.134 0.167 0.183 0.185 0.181 0.180 0.202 0.199 0.203 0.189
EM
Confusion Matrix
22.0% 21.6 20.4 22.7 22.1 22.6 23.1 22.7 24.1 24.4 23.5 26.6 24.9 25.3 24.3 22.3
24.7% 24.5 23.5 26.7 25.3 25.6 25.8 26.0 27.5 28.2 27.3 29.2 28.5 29.0 27.6 26.0
Note: Notice that the true priors of the test sets are (20%, 80%).
As for the experiment reported above, a multilayer perceptron was trained on the basis of an equidistributed training set (pt ( v1 ) D 0.5 D pt ( v2 ) ). An independent and unbalanced test set (with p (v1 ) D 0.20 and p ( v2 ) D 0.80) was selected and scored by the neural network. The experiments (10 replications in each condition) were carried out on the basis of training and test sets with decreasing sizes (1000, 500, 200 and 100 cases), as detailed in Table 3. We rst compared the articial neural network’s output values (g(x) D pOt (v1 | x), obtained by scoring the test sets with the trained neural network) with those provided by the Bayesian classier (b( x) D pt ( v1 | x), obtained by scoring the test sets with the Bayesian classier) on the test sets before output readjustment; that is, we measured the discrepancy between the outputs of the neural and the Bayesian classiers before output adjustment. For this purpose, we computed the averaged absolute deviation between the output value of the neural and the Bayesian classiers (the average of |b( x) ¡ g ( x) |) before output adjustment. Then for each test set, the EM and the confusion matrix procedures were applied to the outputs of the neural classier in order to estimate the new a priori probabilities and the new a posteriori probabilities. The results for a priori probability estimation are detailed in Table 3. By looking at the mean absolute deviation in Table 3, it can be seen that, as expected, decreasing the training set size results in a degradation in the
Adjusting a Classier to New a Priori Probabilities
33
Average classification rate in each condition
100% Using true priors No adjustment After adjustment by EM After adjustment by confusion matrix 95%
90%
85%
#Training=(50,50); #Test=(20,80)
#Training=(100,100); #Test=(20,80)
#Training=(250,250); #Test=(20,80)
#Training=(500,500); #Test=(20,80)
#Training=(50,50); #Test=(40,160)
#Training=(100,100); #Test=(40,160)
#Training=(250,250); #Test=(40,160)
#Training=(500,500); #Test=(40,160)
#Training=(50,50); #Test=(100,400)
#Training=(100,100); #Test=(100,400)
#Training=(250,250); #Test=(100,400)
#Training=(500,500); #Test=(100,400)
#Training=(50,50); #Test=(200,800)
#Training=(100,100); #Test=(200,800)
#Training=(250,250); #Test=(200,800)
#Training=(500,500); #Test=(200,800)
80%
Number of observations in training and test sets
Figure 1: Classication rates obtained on the Ringnorm data set. Results are reported for four different conditions: (1) Without adjusting the classier output (no adjustment); (2) adjusting the classier output by using the confusion matrix method (after adjustment by confusion matrix); (3) adjusting the classier output by using the EM algorithm (after adjustment by EM); and (4) adjusting the classier output by using the true a priori probability of the new data (using true priors). The results are plotted by function of different sizes of both the training and the test sets.
estimation of the a posteriori probabilities (an increase of absolute deviation of about 0.10 between large, i.e., Nt D 1000, and small, i.e., Nt D 100, training set sizes). Of course, the prior estimates degraded accordingly, but only slightly. The EM algorithm appeared to be more robust than the confusion matrix method. Indeed, on average (on all the experiments), the EM method overestimated the prior p( v1 ) by 3.3%, while the confusion matrix method overestimated by 6.6%. In contrast, decreasing the size of the test set seems to have very few effects on the results. Figure 1 shows the classication rates (averaged on the 10 replications) of the neural network before and after the output adjustments made by the EM and the confusion matrix methods. It also illustrates the degradation in
34
M. Saerens, P. Latinne, and C. Decaestecker
classier performances due to the decrease in the size of the training sets: a loss of about 8% between large (i.e., Nt D 1000), and small (i.e., Nt D 100) training set sizes. The classication rates obtained after the adjustments made by the confusion matrix method are very close to those obtained with the EM method. In fact, the EM method almost always (15 times on the 16 conditions) provided better results, but the differences in accuracy between the two methods are very small (0.3% in average). As already observed in the rst experiment (see Table 2), the classication rates obtained after adjustment by the EM or the confusion matrix method are very close to those obtained by using the true a priori probabilities (a difference of 0.2% on average). Finally, we clearly observe (see Figure 1) that by adjusting the outputs of the classier, we always increased classication accuracy signicantly. 4.3 Tests on Real Data. We also tested the a priori estimation and outputs readjustment method on three real medical data sets from the UCI repository (Blake, Keogh, & Merz, 1998) in order to conrm our results on more realistic data. These data are Pima Indian Diabetes (2 classes of 268 and 500 cases, 8 features), Breast Cancer Wisconsin (2 classes of 239 and 444 cases after omission of the 16 cases with missing values, 9 features) and Bupa Liver Disorders (2 classes of 145 and 200 cases, 6 features). A training set of 50 cases of each class was selected in each data set and used for training a multilayer neural network; the remaining cases were used for selecting an independent test set. In order to increase the difference between the class distributions in the training (0.50, 0.50) and the test sets, we omitted a number of cases from the smallest class in order to obtain a class distribution of (p( v1 ) D 0.20, p( v2 ) D 0.80) for the test set. Ten different selections of training and test set were carried out, and for each of them, the training phase was replicated 10 times, giving a total of 100 trained neural networks for each data set. On average over the 100 experiments, Table 4 details the a priori probabilities estimated by means of the EM and the confusion matrix methods, as Table 4: Classication Results on Three Real Data Sets. Data Set
Diabetes Breast Liver
True Priors
20% 20 20
Priors Estimated by EM
24.8% 18.0 24.6
Confusion Matrix
31.3% 26.2 21.5
Percentage of Correct Classication No Adjustment
67.4% 91.3 68.0
After Adjustment by Using EM
Confusion Matrix
True Priors
76.3% 92.0 75.7
74.4% 92.1 75.5
78.3% 92.6 79.1
Note: The neural network has been trained on a learning set with a priori probabilities of (50%, 50%).
Adjusting a Classier to New a Priori Probabilities
35
well as the classication rates before and after the probability adjustments. These results show that the EM prior estimates were generally better than the confusion matrix ones (except for the Liver data). Moreover, adjusting the classier outputs on the basis of the new a priori probabilities always increased classication rates and provided accuracy levels not too far from those obtained by using the true priors for adjusting the outputs (given in the last column of Table 4). However, except for the Diabetes data, for which EM gave better results, the adjustments made on the basis of the EM and the confusion matrix methods seemed to have the same effect on the accuracy improvement. 5 Related Work The problem of estimating parameters of a model by including unlabeled data in addition to the labeled samples has been studied in both the machine learning and the articial neural network communities. In this case, we speak about learning from partially labeled data (see, e.g., Shahshahani & Landgrebe, 1994; Ghahramani & Jordan, 1994; Castelli & Cover, 1995; Towell, 1996; Miller & Uyar, 1997; Nigam, McCallum, Thrun, & Mitchell, 2000). The purpose is to use both labeled and unlabeled data for learning since unlabeled data are usually easy to collect, while labeled data are much more difcult to obtain. In this framework, the labeled part (the training set in our case) and the unlabeled part (the new data set in our case) are combined in one data set, and a partly supervised EM algorithm is used to t the model (a classier) by maximizing the full likelihood of the complete set of data (training set plus new data set). For instance, Nigam et al. (2000) use the EM algorithm to learn classiers that take advantage of both labeled and unlabeled data. This procedure could easily be applied to our problem: adjusting the a posteriori probabilities provided by a classier to new a priori conditions. Moreover, it makes fully efcient use of the available data. However, on the downside, the model has to be completely retted each time it is applied to a new data set. This is the opposite of the approach discussed in this article, where the model is tted only on the training set. When applied to a new data set, the model is not modied; only its outputs are recomputed based on the new observations. Related problems involving missing data have also been studied in applied statistics. Some good recent reference pointers are Scott and Wild (1997) and Lawless, Kalbeisch, and Wild (1999). 6 Conclusion We presented a simple procedure for adjusting the outputs of a classier to new a prioriclass probabilities. This procedure is a simple instance of the EM algorithm. When deriving this procedure, we relied on three fundamental
36
M. Saerens, P. Latinne, and C. Decaestecker
assumptions: 1. The a posteriori probabilities provided by the model (our readjustment procedure can be applied only if the classier provides as output an estimate of the a posteriori probabilities) are reasonably well approximated, which means that it provides predicted probabilities of belonging to the classes that are sufciently close to the observed probabilities. 2. The training set selection (the sampling) has been performed on the basis of the discrete dependent variable (the classes), and not of the observed input variable x (the explanatory variable), so that the withinclass probability densities do not change. 3. The new data set to be scored is large enough in order to be able to estimate accurately the new a priori class probabilities. If sampling also occurs on the basis of x, the usual sample survey solution to this problem is to use weighted maximum likelihood estimators with weights inversely proportional to the selection probabilities, which are supposed to be known (see, e.g., Kish and Frankel, 1974). Experimental results show that our new procedure based on EM performs better than the standard method (based on the confusion matrix) for new a priori probability estimation. The results also show that even if the classier ’s output provides imperfect a posteriori probability estimates, The EM procedure is able to provide reasonably good estimates of the new a priori probabilities. The classier with adjusted outputs always performs better than the original one if the a priori conditions differ from the training set to the real-world data. The gain of classication accuracy can be signicant. The classication performances after adjustment by EM are relatively close to the results obtained by using the true priors (which are unknown in a real-world situation), even when the a posteriori probabilities are imperfectly estimated. Additionally, the quality of the estimates does not appear to depend strongly on the size of the new data set. All these results enable us to relax to a certain extent the rst and third assumptions above. We also observed that adjusting the outputs of the classier when not needed (i.e., when the a priori probabilities of the training set and the realworld data do not differ) can result in a decrease in classication accuracy. We therefore showed that a likelihood ratio test can be used in order to decide if the a priori probabilities have signicantly changed from the training set to the new data set. The readjustment procedure should be applied only when we nd a signicant change of a priori probabilities.
Adjusting a Classier to New a Priori Probabilities
37
Notice that the EM-based adjustment procedure could be useful in the context of disease prevalence estimation. In this application, the primary objective is the estimation of the class proportions in an unlabeled data set (i.e., class a priori probabilities); classication accuracy is not important per se. Another important problem, also encountered in medicine, concerns the automatic estimation of the proportions of different cell populations constituting, for example, a smear or a lesion (such as a tumor). Mixed tumors are composed of two or more cell populations with different lineages, as, for example, in brain glial tumors (Decaestecker et al., 1997). In this case, a classier is trained on a sample of images of reference cells provided from tumors with a pure lineage (which did not present diagnostic difculties) and labeled by experts. When a tumor is suspected to be mixed, the classier is applied to a sample of cells from this tumor (a few hundred) in order to estimate the proportion of the different cell populations. The main motivation for the determination of the proportion of the different cell populations in these mixed tumors is that the different lineage components may significantly differ with respect to their susceptibility for aggressive progression and may thus inuence patients’ prognoses. In this case, the primary goal is the determination of the proportion of cell populations, corresponding to the new a priori probabilities. Another practical use of our readjustment procedure is the automatic labeling of geographical maps based on remote sensing information. Each portion of the map has to be labeled according to its nature (e.g., forest, agricultural zone, urban zone). In this case, the a priori probabilities are unknown in advance and vary considerably from one image to another, since they directly depend on the geographical area that has been observed (e.g., urban area, country area). We are now actively working on these biomedical and geographical problems.
Appendix: Derivation of the EM Algorithm Our derivation of the iterative process (see equation 2.9) closely follows the estimation of mixing proportions of densities (see McLachlan & Krishnan, 1997). Indeed, p ( x | vi ) can be viewed as a probability density dened by equation 2.1. The EM algorithm supposes that there exists a set of unobserved data dened as the class labels of the observations of the new data set. In order to pose the problem as an incomplete data one, associated with the new ( ), observed data, XN 1 D x1 , x2 , . . . , xN we introduce as the unobservable data ( ) ZN z z . . . z , where each vector z k is associated with one of the n , N 1, 2, 1 D mutually exclusive classes: z k will represent the class label 2 (v1 , . . . , vn ) of the observation x k. More precisely, each z k will be dened as an indicator
38
M. Saerens, P. Latinne, and C. Decaestecker
vector: if zki is the component i of vector zk , then z ki D 1 and zkj D 0 for 6 each j D i if and only if the class label associated with observation x k is vi . For instance, if the observation x k is assigned to class label vi , then z k D [0, . . . , 0 , 1, 0 , . . . , 0]T . i¡1 i i C 1
1
n
Let us denote by ¼ D [p( v1 ) , p (v2 ) , . . . , p ( vn ) ]T the vector of a priori probabilities (the parameters) to be estimated. The likelihood of the complete data (for the new data set) is N Y n £ Y
N L ( XN 1 , Z1 | ¼ ) D
p (x k,vi )
¤zki
kD 1 i D 1
N Y n £ Y
D
p (x k | vi ) p ( vi )
¤zki
,
(A.1)
kD 1 i D 1
where p ( x k | vi ) is constant (it does not depend on the parameter vector ¼ ) and the p ( vi ) probabilities are the parameters to be estimated. The log-likelihood is h i N N N l( XN 1 , Z 1 | ¼ ) D log L (X 1 , Z1 | ¼ ) D
N X n X kD 1 i D 1
N X n £ ¤ X £ ¤ z ki log p (vi ) C zki log p ( xk | vi ) .(A.2) kD 1 i D 1
Since the ZN 1 data are unobservable, during the E-step, we replace the N ) log-likelihood function by its conditional expectation over p( ZN 1 | X1 , ¼ : N EZN [l | X1 , ¼ ]. Moreover, since we need to know the value of ¼ in order to 1
compute EZN [l | XN 1 , ¼ ] (the expected log-likelihood), we use, as a current 1 guess for ¼ , the current value (at iteration step s) of the parameter vector, b ( s ) D [b ¼ p ( s ) (v1 ) , b p ( s) ( v2 ) , . . . , b p( s ) ( vn ) ]T , h i N N b( s) ) D EZN l( XN Q( ¼, ¼ ¼ ( s) 1 , Z1 | ¼ ) | X1 , b 1
D
N X n X
h i £ ¤ EZN zki | xk , b ¼ ( s) log p ( vi )
kD 1 i D 1
C
N X n X kD 1 i D 1
1
h i £ ¤ b( s) log p( x k | vi ) , EZN z ki | x k, ¼ 1
(A.3)
where we assumed that the complete data observations f(x k, zk ) , k D 1, . . . , Ng are independent. We obtain for the expectation of the unobservable data h i b( s) D 0 ¢ p( z ki D 0 | xk, b E ZN zki | xk , ¼ ¼ ( s) ) C 1 ¢ p ( zki D 1 | x k, b ¼( s) ) 1
Adjusting a Classier to New a Priori Probabilities
39
D p ( z ki D 1 | xk , b ¼ ( s) ) Db p ( s ) (vi | x k )
b p ( s ) (vi ) b pt (vi | xk ) b pt ( vi ) D n , Xb p ( s) ( vj ) b pt ( vj | xk ) b pt ( vj )
(A.4)
jD 1
where we used equation 2.4 at the last step. The expected likelihood is therefore Q( ¼ , b ¼( s) ) D
N X n X
£ ¤ b p ( s ) (vi | xk ) log p (vi )
kD 1 i D 1
C
N X n X
£ ¤ ( ) b p s ( vi | x k ) log p ( xk | vi ) ,
(A.5)
kD 1 i D 1
where b p( s ) ( vi | x k ) is given by equation A.4. For the M-step, we compute the maximum of Q ( ¼, b ¼ ( s ) ) (see equation A.5) with respect to the parameter vector¼ D [p(v1 ) , p ( v2 ) , . . . , p( vn ) ]T . The new estimate at time step ( s C 1) will therefore be the value of the parameter vector P b( s ) ) . Since we have the constraint, niD1 p (vi ) D 1, ¼ that maximizes Q( ¼ , ¼ we dene the Lagrange function as " # n X ( s) ( ) ( ) ( ) b Cl 1¡ ` ¼ D Q ¼, ¼ p vi iD 1
D
N X n X kD 1 i D 1
( ) b p s ( vi | xk ) log[p( vi )] C
"
Cl 1¡
By computing N X k D1
n X
# p( vi ) .
N X n X
( ) b p s (vi | x k ) log[p(x k | vi ) ]
kD 1 i D 1
(A.6)
iD 1
@`( ¼ ) @p ( vj )
D 0, we obtain
( ) b p s ( vj | x k ) D l p ( vj )
(A.7)
for j D 1, . . . , n. If we sum this equation over j, we obtain the value of the Lagrange parameter, l D N, so that p (vj ) D
N 1 X b p ( s) ( vj | x k ) , N kD 1
(A.8)
40
M. Saerens, P. Latinne, and C. Decaestecker
and the next estimate of p( vi ) is therefore
( b p s C 1) (vi ) D
N 1 X ( ) b p s ( vi | x k ) , N kD 1
(A.9)
so that equations A.4 (E-step) and A.9 (M-step) are repeated until the convergence of the parameter vector ¼ . The overall procedure is summarized in equation 2.9. It can be shown that this iterative process increases the likelihood (see equation 2.6) at each step (see, e.g., Dempster et al., 1977; McLachlan & Krishnan, 1997). Acknowledgments Part of this work was supported by project RBC-BR 216/4041 from the RÂegion de Bruxelles-Capitale, and funding from the SmalS-MvM. P. L. is supported by a grant under an Action de Recherche ConcertÂee program of the Communaut e Fran¸caise de Belgique. C. D. is a research associate with the FNRS (Belgian National Scientic Research Fund). We also thank the two anonymous reviewers for their pertinent and constructive remarks. References Blake, C., Keogh, E., & Merz, C. (1998). UCI repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science. Available online at: http://www.ics.uci.edu/ »mlearn/MLRepository.html. Breiman, L. (1998). Arcing classiers. Annals of Statistics, 26, 801–849. Castelli, V., & Cover, T. (1995). On the exponential value of labelled samples. Pattern Recognition Letters, 16, 105–111. Decaestecker, C., Lopes, M.-B., Gordower, L., Camby, I., Cras, P., Martin, J.-J., Kiss, R., VandenBerg, S., & Salmon, I. (1997). Quantitative chromatin pattern description in feulgen-stained nuclei as a diagnostic tool to characterise the oligodendroglial and astroglial components in mixed oligoatrocytomas. Journal of Neuropathology and Experimental Neurology, 56, 391–402. Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society B, 39, 1–38. Ghahramani, Z., & Jordan, M. (1994). Supervised learning from incomplete data via an EM algorithm. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 120–127). San Mateo, CA: Morgan Kaufmann. Hand, D. (1981). Discrimination and classication. New York: Wiley. Kish, L., & Frankel, M. (1974).Inference from complex samples (with discussion). Journal of the Royal Statistical Society B, 61, 1–37.
Adjusting a Classier to New a Priori Probabilities
41
Lawless, J., Kalbeisch, J., & Wild, C. (1999). Semiparametric methods for response-selective and missing data problems in regression. Journal of the Royal Statistical Society B, 61, 413–438. McLachlan, G. (1992). Discriminant analysis and statisticalpattern recognition. New York: Wiley. McLachlan, G., & Basford, K. (1988). Mixture models, inference and applications to clustering. New York: Marcel Dekker. McLachlan, G., & Krishnan, T. (1997).The EM algorithm and extensions. New York: Wiley. Melsa, J., & Cohn, D. (1978). Decision and estimation theory. New York: McGrawHill. Miller, D., & Uyar, S. (1997).A mixture of experts classier with learning based on both labeled and unlabeled data. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 571–578). Cambridge, MA: MIT Press. Mood, A., Graybill, F., & Boes, D. (1974). Introduction to the theory of statistics (3rd ed.). New York: McGraw-Hill. Nigam, K., McCallum, A., Thrun, S., & Mitchell, T. (2000).Text classication from labeled and unlabeled documents using EM. Machine Learning, 39, 103–134. Papoulis, A. (1991). Probability, random variables, and stochastic processes (3rd ed.), New York: McGraw-Hill. Richard, M., & Lippmann, R. (1991). Neural network classiers estimate Bayesian a posteriori probabilities. Neural Computation, 2, 461–483. Saerens, M. (2000). Building cost functions minimizing to some summary statistics. IEEE Transactions on Neural Networks, 11, 1263–1271. Scott, A., & Wild, C. (1997). Fitting regression models to case-control data by maximum likelihood. Biometrika, 84, 57–71. Shahshahani, B., & Landgrebe, D. (1994). The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hugues phenomenon. IEEE Transactions on Geoscience and Remote Sensing, 32, 1087–1095. Towell, G. (1996). Using unlabeled data for supervised learning. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 647–653). Cambridge, MA: MIT Press. Received April 19, 2000; accepted March 30, 2001.
LETTER
Communicated by George Gerstein
Unitary Events in Multiple Single-Neuron Spiking Activity: I. Detection and Signicance Sonja Grun ¨
[email protected] Department of Neurophysiology, Max-Planck Institute for Brain Research, D-60528 Frankfurt/ Main, Germany Markus Diesmann
[email protected] Department of Nonlinear Dynamics, Max-Planck Institut fur ¨ Str¨omungsforschung, D-37073 G¨ottingen, Germany Ad Aertsen
[email protected] Department of Neurobiology and Biophysics, Institute of Biology III, Albert-LudwigsUniversity, D-79104 Freiburg, Germany It has been proposed that cortical neurons organize dynamically into functional groups (cell assemblies) by the temporal structure of their joint spiking activity. Here, we describe a novel method to detect conspicuous patterns of coincident joint spike activity among simultaneously recorded single neurons. The statistical signicance of these unitary events of coincident joint spike activity is evaluated by the joint-surprise. The method is tested and calibrated on the basis of simulated, stationary spike trains of independently ring neurons, into which coincident joint spike events were inserted under controlled conditions. The sensitivity and specicity of the method are investigated for their dependence on physiological parameters (ring rate, coincidence precision, coincidence pattern complexity) and temporal resolution of the analysis. In the companion article in this issue, we describe an extension of the method, designed to deal with nonstationary ring rates. 1 Introduction
In the classical view, ring rates play a central role in neural coding (Barlow, 1972, 1992). This approach indeed led to fundamental insights into the neuronal mechanisms of brain function (Georgopoulos, Taira, & Lukashin, 1993; Hubel & Wiesel, 1977; Newsome, Britten, & Movshon, 1989). In parallel, however, a different concept was developed, in which the temporal organization of spike discharges within functional groups of neurons— Neural Computation 14, 43–80 (2001)
° c 2001 Massachusetts Institute of Technology
44
S. Grun, ¨ M. Diesmann, and A. Aertsen
so-called neuronal assemblies (Hebb, 1949)—contribute to neural coding (von der Malsburg, 1981; Abeles, 1982b, 1991; Gerstein, Bedenbaugh, & Aertsen, 1989; Palm, 1990; Singer, 1993). It was argued that the biophysics of synaptic integration favors coincident presynaptic events over asynchronous ones (Abeles, 1982c; Softky & Koch, 1993). Accordingly, synchronized spikes are considered a property of neuronal signals that can be detected and propagated by other neurons (Diesmann, Gewaltig, & Aertsen, 1999). In addition, these spike correlations should be dynamic, reecting varying afliations of the neurons, depending on stimulus and behavioral context. Thereby, synchrony of ring would be directly available to the brain as a potential neural code (Perkel & Bullock, 1968; Johannesma, Aertsen, van den Boogaard, Eggermont, & Epping, 1986). Dynamic modulations of spike correlation at various levels of precision have in fact been observed in different cortical areas: visual (Eckhorn et al., 1988; Gray & Singer, 1989; for reviews, see (Engel, Konig, ¨ Schillen, & Singer, 1992; Aertsen & Arndt, 1993; Singer & Gray, 1995; Roelfsema, Engel, K¨onig, & Singer, 1996; Singer et al., 1997; Singer, 1999), auditory (Ahissar, Bergman, & Vaadia, 1992; Eggermont, 1992; DeCharms & Merzenich, 1996; Sakurai, 1996), somatosensory (Laubach, Wessberg, & Nicolelis, 2000; Nicolelis, Baccala, Lin, & Chapin, 1995; Steinmetz et al., 2000), motor (Murthy & Fetz, 1992; Sanes & Donoghue, 1993; Hatsopoulos, Ojakangas, Paninski, & Donoghue, 1998), and frontal (Aertsen et al., 1991; Abeles, Vaadia, Prut, Haalman, & Slovin, 1993; Abeles, Bergman, Margalit, & Vaadia, 1993; Vaadia et al., 1995; Prut et al., 1998). Little is known, however, about the functional role of temporal organization in such signals. First hints toward the importance of accurate spike patterns came from the work of Abeles and colleagues (Abeles, Vaadia, et al., 1993; Abeles, Bergman, et al., 1993; Prut et al., 1998). They observed that multiple single-neuron recordings from the frontal cortex of awake, behaving monkeys contain an abundance of recurring precise spike patterns. These patterns had a duration of up to several hundred milliseconds, repeated with a precision of § 1 ms, and occurred in systematic relation to sensory stimuli and behavioral events. To test the hypothesis that cortical neurons coordinate their spiking activity into volleys of precise synchrony, we developed a method to detect the presence of conspicuous spike coincidences in simultaneously recorded multiple single-unit spike trains and to evaluate their statistical signicance. We refer to such conspicuous coincidences as unitary events and dene them as those joint spike constellations that recur more often than expected by chance (Grun, ¨ Aertsen, Abeles, Gerstein, & Palm, 1994; Grun, ¨ 1996). Briey, the algorithm works as follows: The simultaneous observation of spiking events from N neurons is described mathematically by the joint process composed of N parallel point processes. By appropriate binning, this is transformed into an N-fold ( 0, 1)-process, the statistics of which are described by the set of activity vectors reecting the various ( 0, 1)-constellations occurring across the neurons. Under the null hypothesis of independent ring,
Unitary Events: I. Detection and Signicance
45
the expected number of occurrences of any activity vector and its probability distribution can be calculated analytically on the basis of the single-neuron ring rates. The degree of deviation from independence is derived by comparing these theoretically derived values with their empirical counterparts. Those activity vectors that violate the null-hypothesis of independence dene potentially interesting occurrences of joint-events; their composition denes the set of neurons that are momentarily engaged in synchronous activity. To test the signicance of such unitary coincident events, we developed a new statistical measure: the joint-surprise. For any particular activity vector, the joint-surprise measures the cumulative probability of nding the observed number of coincidences or an even larger one by chance. To account for nonstationarities in the discharge rates, modulations in spike rates and coincidence rates are determined on the basis of short data segments by sliding a xed time window (typically 100 ms wide) along the data in steps of the coincidence bin width. This segmentation is applied to each trial, and the data of corresponding segments in all trials are analyzed as one quasi-stationary data set, using the appropriate rate approximation. Having rst ascertained the statistical signicance of brief epochs of synchronous spiking, the functional signicance of such unitary coincident events is then tested by investigating the times of their occurrence and their composition in relation to sensory stimuli and behavioral events. Thus, Riehle, Grun, ¨ Diesmann, and Aertsen (1997) found that simultaneously recorded activities of neurons in monkey primary motor cortex exhibited context-dependent, rapid changes in the patterns of coincident action potentials during performance of a delayed-pointing task. Accurate spike synchronization occurred in relation to external events (visual stimuli, hand movements), commonly accompanied by discharge rate modulations, however, without precise time locking of the spikes to these external events. Accurate spike synchronization also occurred in relation to purely internal events (stimulus expectancy), where ring-rate modulations were distinctly absent. These ndings indicate that internally generated synchronization of individual spike discharges may subserve the cortical organization of cognitive motor processes. The clear correlation of the precise spike coincidences with behavioral events was interpreted as evidence for their functional relevance (Riehle et al., 1997; Fetz, 1997). Thus, unitary event analysis evoked a considerable amount of interest in the ongoing debate on spike synchronization (Shadlen & Newsome, 1998; Diesmann et al., 1999) and its detectability in experimental data (Pauluis & Baker, 2000; Roy, Steinmetz, & Niebur, 2000; Gutig, ¨ Aertsen, & Rotter, in press). It is currently used in a number of laboratories (Pauluis, 1999; Grammont & Riehle, 1999; Riehle, Grammont, Diesmann, & Grun, ¨ 2000). Here we provide for the rst time a full account of the unitary event method and discuss its underlying principles in detail. In this article, we describe
46
S. Grun, ¨ M. Diesmann, and A. Aertsen
the theory and statistical background of the analysis method for stationary conditions—when the ring rates of the neurons under observation do not change as a function of time. Simulated spike trains, consisting of parallel, independent Poisson processes into which we inserted particular coincident spike constellations under controlled conditions, were used to test and calibrate the method. In the companion article in this issue, we extend the method to deal with nonstationary ring rates and to illustrate its potential by analyzing multiple single-neuron recordings from frontal and motor cortical areas in awake, behaving monkeys. Preliminary descriptions of the method have been presented in abstract form (Grun, ¨ Aertsen, Abeles, & Gerstein, 1993; Grun ¨ et al., 1994; Grun ¨ & Aertsen, 1998) and in Riehle et al., (1997). 2 Detecting Unitary Events in Joint Spiking Activity 2.1 Representation of Joint Spiking Activity. By introducing a temporal resolution D , the spike train of a single neuron i recorded over a time interval of length T can be represented by a binary process vi ( t) , that is, as a ( 0, 1)-sequence. With T D bT / D c denoting the total number of time steps, we dene ( 1, if spike in [t, t C D ) vi ( t ) D 0, if no spike in [t, t C D ) ,
t D 0, 1D , 2D , . . . , ( T ¡ 1) D .
(2.1)
The minimal D is set by the spike time resolution h (in the data we analyzed, typically 1 ms). In our analysis, we used integer multiples of the data resolution D D bh for the binning grid, with b serving as a control parameter for the analysis precision (D is called analysis bin size). Thus, each point in time is assigned to a unique bin, which we refer to as exclusive binning. A single bin, however, may contain more than one spike. Equation 2.1 guarantees that vi ( t) is restricted to 1, even if the corresponding bin contains more than one spike, which we refer to as clipping. The simultaneous observation of spike events from N neurons can now be represented in this binary framework. The activities of the individual neurons i are described by parallel binary sequences vi ( t). Alternatively, we can describe the N sequences as a single sequence of a vector-valued function v ( t ), the components of which are formed by the vi ( t) : 2 3 v1 ( t ) 6 .. 7 6 . 7 6 7 7 v ( t) D 6 (2.2) 6 vi ( t) 7 , i D 1, . . . , NI vi 2 f0, 1g . 6 . 7 . 4 . 5 vN ( t )
Unitary Events: I. Detection and Signicance
v1 . vi . . vN
1 0 1 0 0 0
0 0 0 1 0 1
0 1 1 0 0 0
0 0 0 0 0 0
0 0 0 0 1 0
1 0 0 0 0 0
0 0 1 0 0 1
0 1 0 0 1 0
1 0 0 1 0 0
0 0 0 0 1 1
1 0 1 0 0 0
0 0 0 0 0 0
0 1 0 1 0 0
0 0 0 0 0 0
0 0 1 0 0 0
1 0 0 0 0 1
0 0 0 0 0 0
47 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0
0
0 0 1 0 0 1
0 0 0 1 0 0
t
v (t)
0 0 0 0 0 0
0 1 0 0 1 0
0 0 0 0 0 0
0 0 0 1 0 0
p1 . pi . . pN
T-1
Figure 1: N parallel binary processes, lasting for T time steps. Each horizontal row, consisting of a sequence of 0s and 1s, represents a realization of a single process vi . The 1s mark the occurrences of spike events. The joint activity across the processes at each instant in time can be expressed by a vector v ( t) , as indicated for one example. The empirical ring probability per bin pi of each single process is evaluated as the marginal probability: the number of spikes in the observation time interval, divided by the number of time steps.
This scheme is illustrated in Figure 1. At each time step, v ( t) equals one of the m D 2N possible constellations of 0s and 1s (coincidence patterns). The m possible constellations v k are identied by a (for now) arbitrary index function k (e.g., let v k be mapped to an integer k 2 f1, . . . , mg by interpretk . . . k ) and dening ing v k as the binary representation of an integer ( vN v1 2 k k k D ( vN . . . v1 ) 2 C 1). The empirical number of occurrences of a coincidence pattern v k in data recorded over an interval T is called n k. 2.2 The Null-Hypothesis of Independent Firing. We are interested in detecting the (sub)groups of neurons jointly involved in a cell assembly—the neurons that act in an interdependent manner. To distinguish these neurons from those that are not involved, we develop a statistical tool to test the null hypothesis ( H 0 ) of independent neurons. Under this null-hypothesis, the joint-probability P k D P ( v k ) of a coincidence pattern v k (a particular constellation of spikes and nonspikes across the observed neurons) equals the product of the probabilities of the individual events:
H0 : P k D
N ± ² Y P vik , i D1
± ² with P vik D
(
P ( vi D 1) , 1 ¡ P ( vi D 1) ,
if vik D 1
if vik D 0.
(2.3)
Equation 2.3 assumes independence of the N neuronal processes. In addition, we now make the assumption that the binary sequence describing the activity of a single neuron (equation 2.1) represents a sequence of Bernoulli
48
S. Grun, ¨ M. Diesmann, and A. Aertsen
trials (Feller, 1968). The probability of a specic outcome vi ( t) does not depend on the value of vi at any other point in time. A Poisson process, often used to model neuronal spike trains (see section 5), leads to a binary sequence in accordance with the above assumption. Equation 2.3 is based on precise knowledge of the single-neuron ring probabilities pi D P (vi D 1). However, in the experimental situation, the ring probabilities are typically unknown and have to be estimated from the data. The simplest scheme is to adopt the frequency interpretation (e.g., Feller, 1968) and use the number of spike events ci in the observation interval T containing T time steps to calculate the probability pi D ci / T as an estimate for the ring probability of neuron i (for an alternative approach, see Gutig ¨ et al., in press). The task now is to develop a tool that enables us to judge whether the emp empirical number of occurrences n k of a particular coincidence pattern pred
v k deviates signicantly from the expected number n k
.
2.3 Describing Independently Spiking Neurons by Multiple Bernoulli Trials. Consider a set of parallel realizations of the N binary processes with
duration T. The N resulting T-dimensional row vectors can be combined to a matrix of 0s and 1s with N rows and T columns (see Figure 1). According to the assumptions made in the previous section, the probability of a particular outcome in a specic matrix element does not depend on the outcome in any of the other matrix elements. Equivalently, we can describe the realization as a succession of T N-dimensional column vectors (coincidence patterns). A process generating such N-dimensional events is called a multiple Bernoulli trial (Feller, 1968). Following Feller (1968), we can write the probability of nding each pattern v k exactly n k times in the observation interval directly in terms of the P k: T! y ( n 1 , n 2 , . . . , n m I P1 , P2 , . . . , P m I T ) D Q m
kD 1 n k !
¢
m Y
P nk k .
(2.4)
kD 1
This expression represents a generalization of the binomial distribution to a process with more than two possible outcomes and is called a multinomial distribution. The P k and n k are subject to the normalizing conditions m X kD 1 m X kD 1
Pk D 1
(2.5)
n k D T.
(2.6)
For any particular spike constellation v k (dening that particular v k as “the” outcome and all the rest as “the others”), the probability distribution in equation 2.4 can be reduced to the binomial distribution. For such selected
Unitary Events: I. Detection and Signicance
A Poisson Distr.
49
B Joint p value
0.2
C Joint surprise 3
0.1
2
0 0
20
40
60
S
0.05
80
0 0
¬
0.1
y
y
1
10
20
pred
n
n
30
emp
0 1
Y
n
¬
2 40
3 0 0.25 0.5 0.75 1
Y
n
Figure 2: (A) Three examples of Poisson distributions, for parameters npred D 5, 15, and 50 (from left to right). (B) The black shaded area under the Poisson distribution (npred D 15), ranging from nemp D 25 to innity, indicates the jointp-value Y as the cumulative probability. For this example, the joint-p-value equals 0.0112. (C) The joint-surprise S shown as a logarithmic scaling function of the joint-p-value. The dash-dotted line equals the surprise measure as dened by Palm (1981), and the solid line shows the continuous, differentiable version used here: the joint-surprise (see equation 2.10). The value of the joint-surprise corresponding to the joint-p-value in the example in B is 1.9459.
constellation v k, we obtain
y ( nkI P kI T ) D
T! nk! ¢ ( T ¡ nk ) ! ¢ Pnk k ¢ ( 1 ¡ P k ) T¡nk ,
k D 1, . . . , m.
(2.7)
Since the number of time steps T (or Tb D T / b for a binning grid b) is usually large for bh in the order of 1 ms and the associated probabilities P k are small, while their product P k ¢ T remains moderate, equation 2.7 can be approximated by the Poisson distribution (Feller, 1968; see Figure 2A):
y (n kI PkI T) D
( P k ¢ T ) nk ¢ exp ( ¡P k ¢ T) , nk!
k D 1, . . . , m.
(2.8)
Here, P k ¢ T is the rate parameter of the Poisson distribution, dening the pred
expected number of occurrences n k v k.
D P k ¢T of the joint spike constellation
2.4 Signicance of Joint-Events: The Joint-Surprise. For each of the m constellations v k in the observation set, equation 2.7 describes the nullhypothesis of independent component processes. The expected number of pred
occurrences n k
D P k ¢ T denes the center of mass of the distribution.
50
S. Grun, ¨ M. Diesmann, and A. Aertsen emp
Thus, for each empirical number of occurrences n k , we can now compute the statistical signicance of the deviation from independence. It is dened as the cumulative probability of nding the observed number of emp occurrences n k or an even larger one (an alternative approach of measuring deviation from independence using the framework of information theory is discussed in appendix C). We call this cumulative probability the joint-p-value Y, dened by ± ² emp pred Y nk | nk D
D
1 X emp
nk D nk
± ² pred y nk, nk ± ² pred nk nk
1 X
nk!
emp
nk D nk
± ² pred ¢ exp ¡n k ,
k D 1, . . . , m.
(2.9)
It can efciently be evaluated numerically using the connection to the regularized incomplete gamma function (Press, Teukolsky, Vetterling, & Flannery, 1992). Figure 2B shows Y as the black area under the distribution. The smaller this area is, the higher is the signicance of the corresponding count: emp
if n k if if
emp nk emp nk
pred
> nk ’
nk ’
0
then
S’0
then
S < 0.
2.5 Unitary Events: Denition and Detection. On the basis of the jointsurprise as a measure of signicance, we now dene unitary events as those joint spike events v k in a given interval T that occur much more often than expected by chance. To that end, we set a threshold Sa on the joint-surprise measure and denote the occurrences of those v k for which ± ² emp pred ¸ Sa (2.11) S nk | nk emp
pred
as unitary events. The arguments of S ( n k | n k ) remind us that this test is performed separately for each coincidence pattern v k in the interval T . The raster display (or dot display) is the standard tool used by the electrophysiologist to look for temporal structure in the “raw” spike data (see Figure 3). Since we are interested in the dynamics of assembly activation, we want to detect the joint spike constellations that possibly express assembly activity as they occur in time. To visualize occurrences of potentially interesting coincidence patterns in relation to other instances of itself, other patterns, or other events (e.g., behavioral, stimuli), we mark all spikes in all instantiations of a signicant pattern v k by squares (see Figure 3, lower row). For any Sa , there is necessarily a certain probability of detecting a v k as signicant in a realization of independent processes (false positive). Given a realization of dependent processes generating a surplus of v k, there is a certain probability not to detect the pattern as signicant (false negative). To obtain maximum sensitivity while maintaining a minimum level of false
52
S. Grun, ¨ M. Diesmann, and A. Aertsen
Figure 3: Dot displays in the two top panels show the simultaneous activity of six simulated neurons: independent (A) and dependent (B) ring. Firing rates are: neuron 1: 10 s ¡1 ; 2: 20 s ¡1 ; 3: 15 s¡1 ; 4: 30 s ¡1 ; 5: 25 s ¡1 ; 6: 15 s ¡1 . The spike trains in B are generated by rst copying the spike trains of A. Dependencies between neurons are then introduced by injecting coincident events, consisting of neuron pairs 1,3 and 2,5 (both at coincidence rate of 1 s ¡1 ), randomly distributed in time over all the trials. Each box contains the spike activity of a single neuron over 100 trials of 1000 ms duration. Each dot represents a spike at the time of its occurrence. Trials are organized in rows. Bottom panels: Spikes belonging to statistically signicant constellations (unitary events) are marked by squares. Observe the different numbers of occurrences of unitary events in A and B due to the injected coincidences. In addition, in B, some of the constellations containing the injected spikes as subpatterns are also detected as signicant events.
Unitary Events: I. Detection and Signicance
53
positives (see elaboration in section 4), we set the threshold Sa to a level between 1.28 and 2. This corresponds to a signicance level a between 0.05 and 0.01, a commonly used threshold level in statistical signicance tests (e.g., Hays, 1994, “p-value”). The coincidence patterns v k contain different numbers of spikes ranging from 0 to N. We call the number of spikes in a pattern complexity: j ( vk) D
N X
vik.
(2.12)
iD 1
¡ ¢ There are N j patterns of complexity j . Because each of the patterns is assigned a complexity j 2 f0, . . . , Ng, we recover the total number of patterns by the binomial theorem, N ³ ´ X N D 2N . j j D0
The single pattern of complexity 0 (no spike) and the N patterns of complexity 1 (spike from one neuron) do not represent joint spiking activity in the natural sense. Therefore, we typically concentrate on patterns with j ( v k ) > 1. For a signicant v k with j > 1, each square in a dot display has a counterpart in at least one other box of the dot display at the same time instant. The procedure is illustrated in Figure 3A for simulated realizations of six independent processes and in Figure 3B for six dependent parallel processes. Here, all patterns with j ( v k ) > 1 are tested independently using equation 2.11, and all signicant occurrences are marked according to the convention. However, there is no need to visualize all patterns simultaneously. In an application of the method to experimental data, it might be useful to generate separate raster displays for individual patterns or subsets of patterns. The simulation, like all further simulations, was performed as follows. Several (here N D 6) spike trains of 100s duration were generated using independent homogeneous Poisson processes, each with a particular rate parameter li . The single spike trains were then combined, as if they had been recorded simultaneously from as many neurons. For visualization, spike data are organized in 100 consecutive trials of 1s duration (see Figure 3). Time resolution was set to h D 1 ms. In addition, into one of the data sets (see Figure 3B) we introduced statistical dependencies by injecting pairs of simultaneous spikes into the spike trains of neuron pairs 1, 3 and 2, 5, respectively. Both coincidences occurred at a rate of 1 s ¡1 and were randomly distributed in time, such that on average, each trial contained one injected coincident event. In the context of this article, it is important to note that spike trains were generated by stationary processes and that the analysis was performed once, taking into account the entire data set. “Trials”
54
S. Grun, ¨ M. Diesmann, and A. Aertsen
are introduced here only for visualization. An equivalent description is that each box in Figure 3 displays the activity of a neuron as a page of text (i.e., written left to right and top to bottom). The concept of a trial becomes important only when treating nonstationary data (see the companion article). When data are organized in trials, T is understood to specify the duration of a trial (in time steps) and M ¢ T is the duration of the full data set, with M indicating the number of trials. As expected, the raster displays of the two data sets look very similar (Figure 3, top row). Since the rate of injected coincident spikes (1 s ¡1 ) is low compared to the baseline ring rates, comparison of corresponding ring rates in the two data sets does not reveal any noticeable difference (not shown here). The analysis for unitary events was performed with the thresh¡ ¢ ¡ ¢ old level set at a D 0.05. With 2N ¡ N1 ¡ N0 D 57 patterns independently tested at a D 0.05, we expect to nd 2.85 patterns to be marked as significant. Observe that the dependent data set in Figure 3B (bottom) exhibits many unitary events, whereas the independent data in Figure 3A (bottom) set has almost none. Moreover, the few unitary events in the independent data set consist of spike patterns of complexity 3 and 4, appearing three and one times, respectively. Their signicance is due to statistical uctuations (we will return to this dependence on pattern complexity). In the dependent data set, however, almost all signicant constellations correspond to the injected coincidences (between neurons 1 and 3, and neurons 2 and 5). In addition, some higher-order constellations containing the injected spikes as subpatterns also appear as unitary events, leading to the few squares in the raster display of neuron 6. As was to be expected from the random insertion times of the injected coincidences, the unitary events appear randomly distributed over time and trials. In real neuronal data, however, their times of occurrence may provide information concerning the dynamics of these potentially interesting constellations and their relation to stimuli or behavioral events (Riehle et al., 1997).
3 Dependence of Joint-Surprise on Physiological Parameters
Having derived the joint-surprise as a measure for statistical signicance of joint spiking events, we now investigate its performance with respect to various physiologically relevant parameters: the ring rates of the neurons under consideration, the time resolution (bin size) chosen for the analysis, the rate of spike coincidences, their coincidence accuracy (allowing the biological system some degree of noise), and the number of neurons involved. To this end, we calibrate the performance of the joint-surprise by applying it to appropriately designed sets of simulated data. As before, the control data sets consist of independently generated Poisson trains of varying base rates. These are compared to different data sets, containing additionally injected coincidences of varying complexities and coincidence rates. Typically, the
Unitary Events: I. Detection and Signicance
55
simulated data consisted of M D 100 trials of 1000 ms each and a time resolution of h D 1 ms. The rates of the random Poisson trains were chosen to cover a physiologically realistic range for cortical neurons—between 10 and 100 s¡1 . 3.1 Inuence of the Firing Rate. To investigate the inuence of the neurons’ ring rates, we studied two parallel spike trains generated as independent Poisson processes, with both the same and constant rate. We varied this rate from l D 10 to 100 s ¡1 in steps of 10 s¡1 , in the presence of different constant injection rates lc . Expectation values for the number of coincidences in the data set nemp and the number of coincidences expected to occur assuming independence npred are: h i nemp D lc h C ( lh) 2 ¢ MT
npred D [(lc C l)h]2 ¢ MT.
(3.1)
The probability per time step for a coincidence in the presence of injected coincidences is the sum of the probability of seeing an injected coincidence lc h and seeing a chance coincidence ( lh) 2 . For experimental data, we have to estimate the ring rates from the data set. The marginal probabilities (spike count divided by time interval) cannot distinguish between the base rate and the injection rate. Therefore, we obtain lc C l as the expectation value for the ring rate and [( lc C l)h]2 as the expectation value for the probability to nd a coincidence assuming independence. Further corrections for the specic injection process are discussed in Grun, ¨ Diesmann, Grammont, Riehle, and Aertsen (1999). Values obtained for nemp and npred in different realizations uctuate around their expectation values. To visualize the effect of statistical uctuations, we generated 10 data sets for each rate level. Figure 4A (top) shows that the empirical numbers of coincidences nemp (diamonds) indeed match the number expected assuming independence npred (solid lines), apart from small statistical uctuations. The number of coincidences exhibits a convex dependence on background ring rate. From equation 3.1, we know that the increase is quadratic, ( h2 ¢ MT ) being the coefcient of the leading power. At l D 0, that is, only injected spikes in both situations, the expectation values nemp and npred are (lc h) ¢ MT and ( lc h ) 2 ¢MT, respectively. Comparison of the expressions for nemp and npred in equation 3.1 shows that in the regime ( lc C 2l)h < 1, the difference decreases linearly with l. Variability in counts increases with ring rate because of the well-known property of the Poisson distribution (see equation 2.8) that the count variance equals the mean. Figure 4 (top row) demonstrates that also the expected number of coincidences assuming independence npred exhibits uctuations. These uctuations are caused by the fact that the ring rates have to be estimated from the data. The variance is given by sn2pred D [(l C lc ) h]2 (2l C 4lc ) h ¢ MT,
(3.2)
56
S. Grun, ¨ M. Diesmann, and A. Aertsen
A
l =0.0 s 1 c
B
l =0.5 s 1 c
C
l =1.0 s 1 c
n
1000 500 0
S
20 10 0 20 40 60 80 100 1
20 40 60 80 100
l (s )
1
l (s )
20 40 60 80 100
l (s 1)
Figure 4: Detection of coincident events under different ring-rate conditions. Simulated data from two parallel processes were analyzed for the presence of coincident events. Realizations consisting of 100 trials, each of duration 1000 ms, were generated with a time resolution of 1 ms. Rates of both processes were varied from 10 to 100 s¡1 in steps of 10 s¡1 . The experiment was repeated 10 times at each rate, to visualize statistical variations. Coincident events, also generated by Poisson processes, were injected into each of the 10 data sets at one of two coincidence rates (B: 0.5s¡1 , C: 1s¡1 ). The results of the control experiments without injected events are shown in A. Data were analyzed for the number of empirical occurrences (top row, diamonds) versus expected level (top row, solid lines, theoretical curve in gray), estimated from the marginal probabilities. The corresponding joint-surprise is shown in the bottom panels (diamonds, theoretical curve in gray). Results for the 10 realizations per ring rate are grouped together, giving rise to the stairway-like appearance of the plots. Horizontal lines in the bottom panels indicate the signicance threshold (a D 0.01).
which, with lc < l, is bounded by 3 [( l C lc ) h]3 ¢ MT.
(3.3)
This dependence on the third power in lh renders it much smaller than the variance of nemp in the parameter range of interest (say, lh < 16 , lc < l). Closely related to the probability of obtaining false positives (signicant outcome in the absence of excess coincidences; see section 4) is the question of how precisely npred can be estimated when no coincidences are injected.
Unitary Events: I. Detection and Signicance
57
Comparing the variance of the coincidence counts when no coincidences are injected ( lh) 2 ¢ MT with equation 3.2 for lc D 0 suggests that above lh D 12 , the variance of npred exceeds the variance of nemp . However, at this high probability, the Poisson distribution is no longer a good approximation. Using the binomial distribution (see equation 2.7) in the argument above, it turns out that the variance of npred is always smaller than the variance of nemp . The insight gained by analyzing the variances of npred and nemp to understand the uctuations of S is limited, because the two measures are not completely independent: a high spike count for one of the neurons simultaneously leads to high values for npred and nemp . We present an analysis of the relation of signicance level a to the percentage of false positives obtained in independent data sets in section 4. Figure 4 (bottom row) shows the joint-surprise values corresponding to the (npred , nemp ) pairs (top row). Without injected coincidences, the jointsurprise uctuates around 0, independent of the rate, due to the uctuations in nemp and npred. Because we necessarily have uctuations in npred , the percentage of experiments in which the coincidence count is signicant may differ from the theoretical value (assuming a known npred ) determined by Sa (see section 4). In the case of injected coincident events, the measured and expected coincidence counts deviate from each other, and the more so the higher the injection rate (for Figure 4B, 0.5 s ¡1 ; for Figure 4C, 1 s ¡1 ). The jointsurprise declines hyperbolically with increasing background rate due to the decreasing ratio ( nemp ¡ npred ) / npred . At vanishing background ring rate, the expected number of coincidences assuming independence (lc h) 2 ¢ MT is practically 0, while the number of measured coincidences ( lc h ) ¢ MT remains considerable. Therefore, the joint surprise obtains a large, nite value (not shown). For the injected coincident rate of lc D 0.5s ¡1 (Figure 4B), the jointsurprise falls below the signicance level of 0.01 (horizontal line in bottom graph) at a rate of about 60 s ¡1 (in total, 30 trials below signicance level). For the injected rate of lc D 1.0s ¡1 (Figure 4C), this occurs only at a considerably higher background rate (about 100 s¡1 ). At higher ring rates, more excess coincident events are needed to escape from the statistically expected uctuation range. Clearly, this behavior imposes a severe limit on the detectability of excess coincidences at high ring rates. Before the expectation of the joint-surprise falls below the signicance threshold, the cloud of joint-surprise values obtained in the individual experiments has already reached it (in Figures 4B and 4C, 40 s ¡1 and 70 s¡1 , respectively). However, there is a large regime where uctuations in the joint-surprise are well separated from the signicance threshold, and, hence, excess coincidences can reliably be detected. When injected coincidences are present, the difference between nemp and npred increases linearly with T, while the width p (standard deviation) of the joint-p-value Y increases with T. Therefore, given enough data, excess coincidences can always be detected.
58
S. Grun, ¨ M. Diesmann, and A. Aertsen
3.2 Inuence of Binning. The time resolution of data acquisition in extracellular spike recordings is typically 1 ms or better. There is recent experimental evidence from cross-correlation, joint peristimulus time histogram (JPSTH), and, particularly, from spike pattern analysis, that the timing accuracy of spiking events that might be relevant for brain function can be as precise as 1–5 ms (Abeles, Bergman, et al., 1993; Riehle et al., 1997). Similar suggestions come from modeling studies (Diesmann et al., 1999). Here, we want to investigate whether, by choosing a binning grid D D bh (see equation 2.1) in that time range, we may be able to detect coincidences with corresponding accuracy. Therefore, we will rst study the general inuence of binning on the outcome of joint-surprise analysis and then address the effect of varying bin size on the detection of coincidences with a nite temporal jitter. We generated a set of simulated data as before. While the rate of the independent processes was maintained constant (20 s ¡1 ), we injected additional coincident events at various rates. Two examples for coincident rates of 0.5 s ¡1 and 1.0 s¡1 are shown in Figures 5B and 5C; the control set is shown in Figure 5A. In the analysis, we gradually increased the binning grid from b D 1 to b D 10. If there were more than one spike per bin, the result was set to one (clipping). This newly generated process formed the basis of our investigation. Binning has two opposite effects on the coincidence counts: it reduces the number of time steps, T b D T / b, while increasing the probability pb to observe an event in a time step, compared to the original probability p D lh. The net effect of binning is therefore comparable to that of increasing the rate while reducing the number of observation time steps. Within a single analysis bin of size bh, the probability of nding exactly k of the b possible positions occupied by a spike is given by the binomial distribution. Thus, the probability of nding one or more events P (k ¸ 1) D 1 ¡P (k D 0) equals
pb D
b ³ ´ X b kD 1
k
p k ( 1 ¡ p ) b¡k D 1 ¡ ( 1 ¡ lh) b .
(3.4)
For p ¿ 1, it can be approximated by pb D b ¢ p. Following equation 3.1, the expectation values for the number of coincidences are now µ emp
nb
pred
nb
D
h
± ²2 ¶ lc bh C 1 ¡ ( 1 ¡ lh) b ¢ M ¢ Tb ¡
D 1 ¡ 1 ¡ (lc h C lh)
¢b i 2
¢ M ¢ Tb .
(3.5)
Improvements of expressions 3.5, not relevant in this context, can be made by taking into account interactions of the background spikes and the injected spikes in the binning process (Grun ¨ et al., 1999).
Unitary Events: I. Detection and Signicance
A
l =0.0 s 1 c
B
59
l =0.5 s 1 c
C
l =1.0 s 1 c
n
400 200 0
S
20 10 0 1 2 3 4 5 6 7 8 910
1 2 3 4 5 6 7 8 910
1 2 3 4 5 6 7 8 910
b
b
b
Figure 5: Detection of coincident events using different analysis bin sizes bh. Simulated data from two parallel processes were analyzed for the presence of coincident events. Realizations consisting of 100 trials, each of duration 1000 ms, were generated with a time resolution of 1 ms. Rates of both processes were kept constant at 20 s¡1 . Coincident events were injected at different rates: (A) no injected events, (B) 0.5 s ¡1 , (C) 1.0 s ¡1 . The experiment was repeated 10 times at each bin size, to visualize statistical variations. The bin width bh was varied from 1 to 10 ms. Data were analyzed for the number of coincidences (top row, diamonds) and compared to the expected number of coincidences assuming independence (top row, solid lines, theoretical curve in gray). The corresponding joint-surprise is shown in the bottom panels (diamonds, theoretical curve in gray). Further details as in Figure 4.
pred
nb is concave with positive slope for small b, reaches a maximum, and after passing a point of inection approaches the curve T / b from below. The latter represents an upper bound for the expectation value, reached when each bin is occupied. The initial concave increase can be observed pred
emp
in the simulated data for nb as well as for nb (Figure 5, top row). As in the case of increasing background rate (see Figure 4), the difference between the measured and the expected coincidence counts assuming independence decreases with increasing bin size. The effect can clearly be seen at the high coincidence rate (1 s¡1 ; Figure 5C, top), less so at the lower one (0.5 s ¡1 ; Figure 5B, top). In the regime shown, binning increases occupa-
60
S. Grun, ¨ M. Diesmann, and A. Aertsen pred
emp
tion probability (somewhat stronger for nb than for nb ), and clipping is not dominant yet. In Figure 5C (top), we can clearly observe the ucpred
pred
tuations in nb increasing with b. Here, we simply estimated nb from the binned data. However, uctuations can be reduced by estimating ring rates on the original resolution h and using equation 3.4 to obtain the occupation probability at bin size bh. The dependence of the joint-surprise (see Figure 5, bottom row) on the bin size is similar to the above-described dependence on the rate (cf. Figure 4). In the absence of injected coincidences, S uctuates around 0 (see Figure 5A). For injected coincidences, S decreases with increasing bin size. The lower the injection rate is, the sooner S starts to decrease and the faster it decays: for lc D 0.5s ¡1 , joint-surprise values start to fall below the 0.01 signicance level at b D 3 (see Figure 5B), while for lc D 1.0s¡1 , signicance is maintained up to about b D 6 to 10 (see Figure 5C). Again, the decline in S is controlled by the decreasing ratio emp pred pred ( nb ¡ nb ) / nb . The similarity between the dependences of S on spike rate and on bin size is not surprising, considering that binning has the net effect of an apparent increase in ring probability, limited by the additional effect of clipping. 3.3 Detection of Near-Coincidences. In a next step, we investigate whether it is also possible to detect noisy (i.e., imprecise) coincidences. This question arises naturally, since neurons are usually considered to exhibit some degree of “noise” or uncertainty in the timing of their action potentials. Note, however, that the degree of this temporal noise has long been questioned (e.g., Abeles, 1983) and is still under debate, (e.g., Mainen & Sejnowski, 1995; Shadlen & Newsome, 1998; Diesmann et al., 1999). While keeping both the independent background rate and the injection rate constant, we increase the temporal jitter of the injected near-coincident events stepwise from 0 to 5 ms, such that in each case, the difference in spike times is uniformly distributed within the chosen jitter range. The question is whether, by choosing an appropriate binning grid, we can improve the detection of such near-coincident events. To this end, we analyze the simulated data with varying bin sizes and for each bin size compute the joint-surprise. Figure 6 shows the results for a background rate of l D 30s ¡1 and a rate of injected near-coincidences lc D 2s ¡1 . Each of the curves in Figure 6A represents data with a particular temporal jitter s, analyzed with a bin size bh increasing from 1 ms to 10 ms. Values of S are averages of 100 repetitions of the simulation experiment at constant parameters. Each curve exhibits a global maximum (marked by an asterisk) at a bin size b¤ close to the magnitude of the jitter of the injected coincidences; b¤ is shown as a function of temporal jitter in Figure 6B. Indeed, the maxima occur at the bin size that in a given simulation just covers the maximal jitter (e.g., spikes with a maximal time difference of s D 1 are covered by a bin size spanning two time steps of the original time resolution: b D 2). Numerical analysis of an analytical
Unitary Events: I. Detection and Signicance
A 20
B
Significance
5 0
Best Width
*
10
b
S
15
7 6 5 4 3 2 1 0 0
61
1 2 3 4 5 6 7 8 910 b
1
2
3
±s
4
5
Figure 6: Detection of near-coincidences for different degrees of coincidence precision. Two parallel spike trains were generated with background rates l D 30s¡1 the rate of the injected coincidences was lc D 2s¡1 , for T D 105 time steps, h D 1 ms. The temporal jitter of coincident events was varied from s D 0 to s D 5 time steps. Each simulation was repeated 100 times, and data were analyzed for the number of observed coincidences by varying the analysis bin size from b D 1 to b D 10, and compared to the expected coincidence count assuming independence. (A) Each curve shows the resulting average joint-surprise as a function of the analysis bin size for a given temporal jitter. The top curve shows the results for s D 0, the bottom curve for s D 5, intermediate scatters in between (using the maxima as reference). Maxima of the curves are marked by an asterisk. (B) Optimal bin width b¤ for detecting excess coincidences as a function of temporal jitter, theoretical curve indicated by dashed line.
description of the situation (Grun ¨ et al., 1999) shows that maxima are located at b¤ D s ¡ 1. The fact that in the simulation results in Figure 6B, the maxima for s D 4 and 5 occur at a larger bin size is due to uctuations remaining in the averaged S. For bin sizes smaller than the scatter width, S increases with bin size since for more and more near-coincidences, the constituting spikes fall into a common bin. At bin sizes larger than the coincidence accuracy, the rate at which the number of excess coincidences grows drops, and the probability that an injected coincidence is detected slowly reaches saturation. Thus, S is bound to decrease again, because the expected coincidence count assuming independence continues to grow approximately linearly. The joint-surprise curves for nite temporal jitter s approach the curve for perfect coincidences s D 0 from below. The comparison of different joint-surprise curves shows that the higher the temporal jitter (i.e., the lower the coincidence accuracy) is, the lower the joint-surprise is. Hence, for a given b, the number of nearcoincidences that can be detected increases with decreasing temporal jitter. 3.4 Multiple Parallel Processes. When the number of simultaneously observed neurons N is increased, the variety of coincidence patterns grows
62
S. Grun, ¨ M. Diesmann, and A. Aertsen
strongly, due to the nonlinear increase in combinatorial possibilities. Each complexity j (i.e., a spike pattern with j 1s, N ¡ j 0s) can in principle oc¡ ¢ cur in N j variations. On the other hand, the occurrence of higher-order constellations depends in a nonlinear fashion on the rates. The probability for a pattern with complexity j to occur among N neurons, all ring independently with probability p, is given by pj ¢ ( 1 ¡ p) N¡j . For low ring probabilities (p ¿ 1), this can be approximated by pj . By the combination of these two effects, constellations of high complexity are actually expected to occur rarely. For low ring probabilities, such as p D 2 ¢ 10¡2 , the expected count for a coincidence pattern of complexity j D 3 (assuming a total number of observation time steps T D 105 ) is of the order of 1 or less, and even less for higher complexities (see Figure 7B, top). For higher ring probabilities, this expectation is shifted to larger values. Consider a pattern of complexity 4, with other parameters as above. The expected coincidence count now is npred D 0.016; the probabilities of nding 0, 1, or 2 coincidences are approximately y ( 0, npred ) D 0.9841, y ( 1, npred ) D 0.0157, and y ( 2, npred ) D 0.0001. Here, the discrete nature of the Poisson distribution is fully exhibited. Almost all the mass is at a single outcome (0). The joint probabilities of outcomes 1 and 2 are Y( 1 | npred ) D 0.0159 and Y( 2 | npred ) D 0.0001, respectively. Thus, at an a-level of 0.01, the occurrence of two coincidences is already signicant and would still be signicant for much lower a values. If the occurrence of 2 or more coincidences than expected is signicant for almost any signicance level, our measure is obviously susceptible to uctuations. The signicance of the spike constellation in a particular experiment cannot be determined precisely. The obvious and standard cure to this problem is to collect more data for such an experimental situation, shifting the distribution of the coincidence counts for patterns of high complexity to larger expectation values, where the discrete nature of the distribution is of less importance. As a result of the above discussion, high j constellations, if occurring at all, are typically accompanied by high joint-surprise values (cf. Figure 7B, bottom). It is therefore not surprising that in simulations where we varied the complexity of the injected coincidence patterns from j D 2 to 6 (while keeping the number of processes (N D 6), the background rate (l D 20s ¡1 ) and the injection rate (lc D 1s ¡1 ) constant), all coincidences of complexity ¸ 3 were detected with high signicance (see Figure 7B, bottom). Moreover, the measured coincidence counts for j ¸ 3 are close to the expectation for the injected coincidence lc hT D 100 (see Figure 7B, top). For complexity 2, the coincidence count is higher, because we get contributions from the background rate ( lh) 2 T D 40. This contribution is rapidly vanishing for higher complexities (e.g., j D 3, (lh ) 3 T D 0.8). Similar results were obtained when we increased the number of independent processes N (from 2 to 12), while keeping the complexity of injected coincidences constant (j D 2, Figure 7A). Here, injection means that
Unitary Events: I. Detection and Signicance
A
B
x=2
200
63
N=6
150
n
100 100 50 0
0
60
400
40
S
200 20 0
2
4
6
N
8
10
12
0
2
3
4
x
5
6
Figure 7: Complexity of joint spike patterns. (A) The number of processes into which a pair of coincidences (complexity j D 2) was injected was varied from 2 to 12. The rate of the independent processes was l D 20s¡1 , and the injection rate was 1 s ¡1 (T D 105 , 10 repetitions). The diamonds in the upper graph show the number of occurrences of the coincidence patterns [110], [1100], [11000], and so on, with the number of zeros depending on the number of processes. The expected counts are shown as solid lines. The lower graph shows the joint-surprise for the corresponding pairs of measured and expected counts; the horizontal line marks the signicance level of 0.01. (B) The number of processes was kept constant (N D 6), as were their rates (parameters as in A), but the complexity of the injected coincidences was varied from 2 to 6. Thus, the pattern looked for was [110000], [111000], and so on, respectively. The measured counts of the coincidence pattern are displayed as diamonds, the expected counts as solid lines. The latter values cannot be distinguished from 0 for j > 3 in this graph, because values become very small. The corresponding joint-surprise (bottom panel) was therefore very high (clipped here to an arbitrary value of 400 for visualization).
simultaneous spikes are added to j of the N parallel spike trains, without affecting the N ¡ j remaining ones. It turns out that the joint-surprise of the j constellation (i.e., j 1s and N ¡ j 0s) at the given ring rates is practically independent of N. There is a small decrease in the number of occurrences of this particular pattern, because with increasing N, more patterns containing the two spikes as a subpattern become available. However, this effect does not seriously affect the detectability of the injected
64
S. Grun, ¨ M. Diesmann, and A. Aertsen
coincidences. The situation changes when massively parallel data are examined, and patterns of higher complexity become typical. Let the two neurons under consideration be accompanied by 98 other neurons, other parameters as above. The probability that at least 1 of the 98 neurons con¡98¢ P 98¡k , which is k tributes a spike to the coincidence is 98 kD 1 k ( lh) ( 1 ¡ lh) 98 1 ¡ ( 1 ¡ lh) ¼ 0.86. We conclude that to decide on the empirical relevance of coincidences of higher complexities (j ¸ 3), given a moderate amount of data, it is advisable to set additional criteria, for example, by requiring a minimum absolute number of occurrences (see also Abeles, Bergman, et al., 1993; Martignon, Laskey, Deco, & Vaadia, 1997; Martignon et al., 2000). 4 False Positives
Up to now, we have studied the sensitivity of our method by exploring under which conditions excess coincident events are detectable. However, while striving for high sensitivity (a low fraction of false negatives), we simultaneously need to ensure an appropriately high degree of specicity (a low fraction of false positives). Such false positives are the result of incorrectly assigning the label “excess coincidences” to an experiment where they in fact are not in excess. Thus, we have to establish conditions under which we reach a compromise between a sufcient degree of sensitivity and an acceptable degree of specicity. Therefore, we now analyze various sets of simulated data, with the combined requirement of attaining a high level (90%) of detection (only 10% false negatives), while securing a low level (10%) of false positives. As in the preceding sections, the simulations are described by biologically relevant parameters, varied over a physiologically realistic regime. 100 independent experiments were performed for each parameter value; from these, the percentage of experiments that crossed a certain threshold level on the joint-surprise was evaluated. This threshold level a was varied in equidistant steps to cover the range of joint-surprise values between ¡15 and C 15. 4.1 Inuence of the Firing Rate. In the rst step, we kept the number of independent processes constant (N D 2) and varied the rate of the processes. We found that for constellations of complexity 2, the percentage of false positives is practically independent of the background rates (see Figure 8A, left). This is not surprising, because if the rates of the underlying processes were known, and therefore the expected number of coincidences assuming independence npred could be determined without error, a would represent the percentage of experiments passing Sa . The above result ensures that for the parameter regime tested, determination of ring rates from the data does not cause dramatic deviations of the percentage of false positives from the theoretical level a.
Unitary Events: I. Detection and Signicance
65
By contrast, the sensitivity for detecting excess coincidences shows a clear dependence on background rates. At low rates, it is very high, but it decreases—rapidly at rst, more slowly later—with increasing background rate (see Figure 8A, middle). At background rates above l D 60s ¡1 , the threshold for detecting the injected events has decayed to about a D 0.05. Combining these two observations in a single graph, we obtain the intersection range of the joint-surprise, necessary to obtain both maximally 10% false positives and minimally 90% sensitivity (the white area in Figure 8A, right). For low a, this region is bounded by an approximately straight vertical line at a D 0.05; the lower boundary of the permissible signicance measure is approximately independent of the background rate. The upper bound, however, is clearly curved: the threshold needed for reliable detection decreases with increasing background rate, reaching a level of only 0.05 at l D 60s ¡1 . Thus, the higher the rate is, the narrower is the bandwidth of a-values permissible to detect excess coincident events selectively and sensitively. 4.2 Inuence of the Number of Parallel Processes. Next, we varied the number of independent processes (from 2 to 12) while keeping the rates constant (l D 20s ¡1 ). For each number of processes, the fractions of false positives and false negatives were evaluated at different threshold levels. We found that the fraction of false positives increased with decreasing threshold and in the given range was independent of the number of processes involved (see Figure 8B, left). Moreover, the sensitivity for excess coincidences (shown for complexity 2 at coincidence rate of 1 s ¡1 in Figure 8B, middle) was independent of the number of processes as well. The intersection range of the joint-surprise, necessary to obtain maximally 10% false positives and maximally 10% false negatives, is shown in white in Figure 8B (right panel). Observe the wide parallel band for selective and sensitive detection, independent of the number of observed processes. If more restrictive criteria (fewer false positives and/or fewer false negatives) are adopted, the band becomes accordingly smaller (not shown here). 4.3 Inuence of Pattern Complexity. Higher-order coincidences (coincidences with high complexity j ) are rarely found in data with low ring rates and limited numbers of observation time steps (see also section 3.4). Figure 8C (left) illustrates that there are hardly any false positives for complexities 4 or higher, even for threshold levels of a > 0.5. Let na represent the smallest n for which Y( na | npred ) < a. Because of the discrete nature of the distribution of coincidence counts y (see section 3.4), Y( na | npred ) can actually be much smaller than a (see the example in section 3.4). If npred is exact, Y( na | npred ) actually is the fraction of false positives expected. Therefore, the percentage of false positives can be much smaller than a. This does not contradict the fact that if such coincidences occurred, its joint-surprise would indicate high signicance. False positives of lower complexity (up to
66
S. Grun, ¨ M. Diesmann, and A. Aertsen
3) do show up for threshold levels a below 0.5, but their fraction decreases with pattern complexity. The situation is even more clear-cut for the case of false negatives (see Figure 8C, middle). The detection of injected coincidences is practically 100%. Thus, the intersection graph of the joint-surprise for fewer than 10% false positives and fewer than 10% false negatives (see Figure 8C, right) shows noncompliance for negative values of Sa and j < 4. 5 Discussion
We described a new method to analyze simultaneously recorded single-unit spike trains for signs that (some of) the observed neurons are engaged in a cell assembly. We adopted a widely used operational denition, dening common assembly membership on the basis of near-simultaneity of the joint spike activity of the observed neurons (Abeles, 1982a; Gerstein et al.,
Unitary Events: I. Detection and Signicance
67
1989). The simultaneous observation of spiking events from N neurons was described by the joint process of N parallel spike trains. By appropriate binning, this was transformed to an N-fold (0,1)-process, a realization of which is represented by a sequence of N-dimensional activity vectors (instances of coincidence patterns) describing the various (0,1)-constellations that occurred across the recorded neurons. Under the null hypothesis of independently ring neurons, the expected number of occurrences of any coincidence pattern and its probability distribution could be calculated analytically on the basis of the single-neuron ring rates. The degree of deviation from independence among the neurons was evaluated by comparing the theoretically expected counts with their empirical counterparts. In order to test the signicance of deviations from expectation, we developed a new statistical measure: the joint-surprise. For any coincidence pattern, the joint-surprise measures the probability of nding the observed number of occurrences (or an even larger one) by chance. Those coincidence patterns that violate the null-hypothesis of independence dene potentially interesting occurrences of unitary joint events. The neurons that contribute a spike to the signicant coincidence pattern are considered a subset of the neurons currently engaged in assembly activity. To calibrate the new method and test its performance, we applied it to simulated data sets in which different physiological and analysis parameters were varied in systematic fashion. We used independent Poisson processes Figure 8: Facing page. Selectivity and sensitivity as a function of ring rate (A), number of neurons (B), and pattern complexity (C). In the left column, the percentage of false positives (fp), that is, 1 ¡ selectivity, is calculated using independent data sets, without injected coincident events. The percentage of false negatives (fn), that is, 1 ¡ sensitivity, for injected events as a function of the threshold level Sa (abscissa) is shown in the middle column. The overlap regions of maximally 10% false positives and maximally 10% false negatives are indicated in white in the right column. (A) The percentage of false positives and false negatives of pair coincidences within spike data of two parallel processes. The rates of both independent processes l were identical and increased from 10 s¡1 to 100 s¡1 in steps of 10 s ¡1 (ordinate). In the middle and right panels, coincident events of complexity 2 were injected (1 s ¡1 ) into the spike data. For each rate, the experiment was repeated 100 times, T D 105 , h D 1 ms. The density plots (gray-level coding as indicated) represent the percentage of experiments that crossed the threshold level Sa (fp, left column) or remained below it (fn, middle column), respectively. (B) The percentage of fp and fn for coincidences of complexity 2 for varying numbers of processes N (ordinate), increasing from 2 to 12 (ring rate held constant at 20 s ¡1 , coincidences injected at rate 1 s ¡1 into 2 of the N processes). Display and other parameters as in A. (C) The percentage of fp and fn for coincidences of varying complexity (ordinate) in a xed number of processes (N D 6). Display and other parameters as in A. For orientation, the threshold level for a D 0.05 is indicated by a dashed line in all plots.
68
S. Grun, ¨ M. Diesmann, and A. Aertsen
to generate control data at various ring rates. The degree of interdependence of the surrogate data was controlled by injecting coincident spiking events at different rates, timing precision, and neuron composition. Specifically, we measured the sensitivity (the probability not to generate false negatives) and the specicity (the probability not to generate false positives) of the method and determined its dependence on various physiological parameters. Overall, the method proved to be both highly sensitive and highly specic to detect the presence of even weak signs of coincident spiking. Moreover, the method is only moderately sensitive to wide-range variations of the tested parameters, largely covering the physiologically relevant regime encountered in cortical neuron recordings. Thus, unitary event analysis provides a simple measure to test for the presence of excess coincident spiking events in experimental data. Since the method takes into account the ring rates of the observed neurons, results from different experiments and/or recordings may be compared. The principal ingredient of the method is the joint-surprise. It provides a convenient measure of the probability that the number of coincident spiking events represents a chance constellation. One may stop at this point and use the resulting probabilities (e.g., by comparing them across different experimental or behavioral conditions) as a means to assess the functional relevance of synchronous spiking. Another way to proceed, explored in this article, is to adopt a common approach in statistics by imposing a threshold level on the joint-surprise function and to focus on the data where this minimum signicance level (e.g., a D 0.05 or 0.01) was surpassed. In doing so, selections from the data are highlighted as potentially interesting regarding the presence of excess coincident spiking events. We referred to these events as unitary events, marking highly unexpected joint spike constellations. Their neuronal composition, as well as the moments at which they occur, may provide information about the underlying dynamics of assembly activation. It is worthwhile to point out that our method does not allow us to distinguish on an individual spike basis which one is an excess event and which is not. Hence, all instances of a signicant coincidence pattern are marked. Nevertheless, unitary events may well occur inhomogeneously distributed over the time interval studied, revealing a potentially interesting time structure in relation to the experiment that is not present in the original stationary ring rates. We have formulated the null-hypothesis in terms of statistical independence. In cases where a specic time structure within a single spike train is of interest (e.g., Legendy & Salcman, 1985; Dayhoff & Gerstein, 1983), independence is often formulated as the assumption that the neuronal spike train is a realization of a Poisson process. In cases where parallel processes are tested for spatio and/or temporal patterns (as in our case), often independent Poisson processes are assumed—that is, both independence within the spike trains and independence between them (Palm et al., 1988; Abeles & Gerstein, 1988; Aertsen et al., 1989; Prut et al., 1998). Physiological data, however, often violate the Poisson assumption, and it
Unitary Events: I. Detection and Signicance
69
is not yet clear how to correct for that in a general (i.e., model-free) manner. One option is to make different assumptions about the nature of the underlying point process, for example, to assume that it is a renewal process (Cox & Isham, 1980) or, more specically, a c -process (e.g., Pauluis & Baker, 2000; Baker & Lemon, 2000). We have tested the inuence of violations of the Poisson assumption on the occurrence of false positives using c -processes (see appendix B). Results from our parametric study, where the structure of the point processes was varied from bursty to regular ring, indicate that the unitary event analysis method is quite robust against such violations. Another option, which we are currently exploring, is a bootstrap-type method, shufing the spike trains across trials to generate surrogate data from which one can estimate the expected numbers of the various constellations quasi-empirically (Pipa, Singer, & Grun, ¨ 2001). By this procedure, the temporal structure of the individual spike trains is taken into account, and an explicit hypothesis about the generation processes need not be made. Another aspect of extending the formulation of the null-hypothesis is to take into account correlations among subsets of neurons. In this context, a promising new approach proposed recently is to extend the null-hypothesis of independence to incorporate interactions among subsets of the neurons contributing spikes to a given coincidence pattern (Martignon, von Hasseln, Grun, ¨ Aertsen, & Palm, 1995; Martignon et al., 1997, 2000). Constellations of complexity higher than 2 in independent multiple parallel processes are relatively rare. However, if they occur, they are very likely to be detected as false positives in data sets of nite length (see section 3.4; Roy et al., 2000). In order to account for that, it is advisable to apply an additional test at a meta-level, for example, by requiring a minimal absolute number of occurrences of the high-complexity event or by applying an additional statistical test (Prut et al., 1998). Also here, bootstrap techniques may be invoked (e.g., Nadasdy, Hirase, Czurko, Csicsvari, & Buzsaki, 1999) to provide additional means to differentiate false positives from true positives in such regimes of relatively rare occurrences. Another source for false positives is the violation of the assumption of stationarity. The ring rates, measured by averaging over time (and trials) and serving as the basis to test the null-assumption, may not reect the instantaneous behavior. Particularly in regions with a higher-than-average rate, unitary events may be detected by our method for incorrect reasons (for related problems with cross-correlation measures, see, e.g., Brody, 1999a, 1999b). Unfortunately, however, a strict requirement of stationarity may sometimes disqualify a large portion of the experimental data, especially from awake, behaving animals. Therefore, a more promising approach is to adopt techniques that enable us to make reliable estimates of the instantaneous ring rates (Nawrot, Aertsen, & Rotter, 1999; Pauluis & Baker, 2000). Giving up the concept that neuronal spiking is driven by a (potentially time-dependent) intensity function, the signicance test can also be based
70
S. Grun, ¨ M. Diesmann, and A. Aertsen
on counting statistics, thereby removing the problem of rate estimation from experimental data (Gutig ¨ et al., in press). The method of unitary event analysis bears clear relations to the dynamic analysis of cross-correlation, as exemplied, for example, in the JPSTH (Aertsen et al., 1989). It presents an extension, in that it enables us to analyze more than two neurons at a time, and a restriction, by focusing on coincident events only. In principle, the method could also accommodate any specic arrangement of coincidence delays unequal to zero. However, the combinatoric problems associated with exploring all possible such arrangements are beyond our present capabilities. The synre model (Abeles, 1991) has prompted scientists to search specically for spatiotemporal ring patterns in multiple single-neuron spike trains (Abeles & Gerstein, 1988; Abeles, Bergman, et al., 1993; Villa & Abeles, 1990; Prut et al., 1998; Date, Bienenstock, & Geman, 1998). However, in contrast to the method of Prut et al. (1998), unitary event analysis focuses on spatial patterns. Binning provides a general and straightforward mechanism to control the amount of temporal jitter allowed in the denition of a coincidence. It is applicable to N parallel processes. Unfortunately, for coincidences with large temporal jitter, sensitivity is reduced due to the ssion of coincidences at the binning grid (see section 3.2; Grun ¨ et al., 1999). For N D 2, methods have been developed to detect near-coincidences without the need of binning (Grun ¨ et al., 1999; Pauluis & Baker, 2000). However, no method currently exists for N > 2. We are exploring the detection of near-coincidences in large numbers of parallel processes without discretization of time (Grun ¨ & Diesmann, 2000). One important issue remains to be solved before we can apply this framework to physiological data and study neuronal assembly dynamics in relation to stimuli and behavioral events. Until now, we have considered only the case of neurons ring at a stationary rate and with stationary coincident activity among them. Physiological data, however, are usually not stationary. Firing rates vary considerably as a function of time, particularly when the animal is presented with adequate stimuli or is engaged in a behavioral task. A second type of nonstationarity is that coincident ring itself may be nonstationary for example, by being time-locked to a stimulus or behavioral event even if the rates of the neurons are constant (Vaadia et al., 1995). Since our analysis so far derives its measures globally from the entire observation interval, the time-locked occurrence of coincidences might be overlooked. In the companion article in this issue, we address both types of nonstationarities and extend our theoretical framework accordingly. Appendix A: Notation
T h T M
temporal duration of observation interval, [T ] D unit of time time resolution of data, [h] D unit of time temporal duration of observation interval in units of h, [T] D 1 number of trials
Unitary Events: I. Detection and Signicance
vi N v ( t) vk
m j ( vk) n emp nk pred nk P pi Pk y Y S a l lc b Tb pb s MD k MD( t)
71
(0,1)-sequence of neuron i number of simultaneously observed neurons coincidence pattern at time step t, N-dimensional vector coincidence pattern k, N-dimensional vector number of possible patterns complexity of v k general coincidence count empirical coincidence count of v k expected coincidence count of v k general probability in expressions like P ( k ¸ 1) occupation probability for neuron i probability of coincidence pattern v k distribution of coincidence counts joint-p-value joint-surprise signicance level background ring rate, [l] D 1 / unit of time coincidence rate, [lc ] D 1 / unit of time bin size in units of h, [b] D 1 number of time steps after binning, [T b ] D 1 occupation probability after binning temporal jitter of injected coincidences in units of h, [s] D 1 mutual dependence of v k; see appendix C time-resolved mutual dependence
Appendix B: Violation of the Assumption of Poissonian Spike Trains
In order to test how sensitive the unitary event analysis method is to a violation of the assumption of Poisson spike trains, we conducted the following experiment: Independent parallel spike trains (N D 2) were modeled as c processes and analyzed for the occurrence of signicant coincident events— false positives (similar to section 4). A c -process allows us to vary the spike train structure from “burstiness” to regular spiking by variation of a single parameter only: the “shape” parameter c . c -processes belong to the class of renewal processes and can be simulated by successively drawing interspike intervals from the interval distribution f ( t ) D l ¢ e ¡lt ¢
( lt ) c ¡1 . C (c )
(B.1)
For c D 1 the spike train is Poissonian (coefcient of variation (CV) D 1). If c is chosen < 1, the resulting spike train exhibits clusters or bursts of spikes, leading to a high variability of the interspike intervals (CV > 1). By contrast, if c is chosen > 1, the spike train is more regular; the higher the c is, the smaller is the variability of the interspike intervals (CV < 1).
72
S. Grun, ¨ M. Diesmann, and A. Aertsen
1.5
% fp
100 90 80 70 60 50 40 30 20 10 0.1 0.5
Average over rate levels 2
1 0.5
1
5 g
10
50
0
% fp
l (s 1)
% false positives (a =0.01)
6 4 2 0 0
a=0.05
a=0.01
10
20 g 30
40
50
Average over shape parameters 6 a=0.05 4 2 a=0.01 0 10 20 30 40 50 60 70 80 90 100 l (s 1)
Figure 9: False positives in non-Poissonian spike trains. Two parallel spike trains of duration T D 100 s and time resolution h D 1 ms were simulated as independent c -processes with parameters rate l and shape factor c . l was varied from 10 to 100 s ¡1 in steps of 10 s ¡1 , and the shape factorc was varied between 0.1 and 50, in steps of 0.1 between 0.1 and 1, up to 10 in steps of 1, and above in steps of 5. For each parameter constellation, the simulation was repeated 1000 times; the percentage of cases showing signicant outcomes at given signicance levels are derived as false positives (fp). (A) The matrix illustrates the percentage of false positives in gray code as a function of shape factor (horizontal) and given rate parameter (vertical) for a signicance level a D 0.01. Note that the resulting rate may differ from the given rate parameter, since for shape factors c < 1, a relatively large number of spikes occur with interspike intervals · h, which are clipped to one spike per time resolution bin in the simulation process. The larger the given rate and the smaller c , the larger the reduction in rate (at l D 100s ¡1 and c D 0.1 about 50%). (B) Percentage of false positives as a function of c averaged over all rate levels (top) and as a function of rate parameter l averaged over all shape factors (bottom) displayed for various signicance levels a D 0.01, 0.02, . . . , 0.05.
For c ! 1, the process approaches a clock process, with a xed value for the interspike interval. Thus, by varying the shape factor from 0.1 to 50, we covered a wide range of variability of experimentally observed spike trains (e.g., Softky & Koch, 1993; Baker & Lemon, 2000; Nawrot, Riehle, Aertsen, & Rotter, 2000). For the signicance test, the same procedure was used as introduced for the Poissonian spike trains: a Poisson distribution with its mean set to the expected number of coincidences (see equations 2.3 and 2.9). Two parameters were systematically varied in the simulations: the rate parameter l of the processes and the shape factor c . For each parameter constellation, the simulation of duration T D 100s was repeated 1000 times, and the percentage of cases showing signicant outcomes was derived. Figure 9A illustrates
Unitary Events: I. Detection and Signicance
73
the percentage of false positives in gray code as a function of shape factor (horizontal) and rate parameter (vertical) for a signicance level a D 0.01. Observe that the percentage of false positives varies between 0% and 2%— in a range around the expected value given the applied signicance level of 1%. The matrix does not appear to be clearly structured but shows a weak tendency for higher percentages of false positives (2%) with increasing rate and shape factor. The projections (and averages) of the results onto the shape axis (see Figure 9B, top) and on the rate axis (see Figure 9B, bottom) show that an increasing rate does not vary the number of false positives but with increasing shape factor, the number of false positives increases slightly, which is somewhat stronger for less strict signicance levels (a D 0.02 ¢ ¢ ¢ 0.05). In summary, we conclude that the unitary event analysis method behaves quite robustly with respect to the signicance level a against a violation of the Poisson assumption, realized here as c -processes. Appendix C: Mutual Dependence
A different approach to detect dependencies in parallel spike data is to use a measure related to the general framework of information theory: mutual dependence, derived from the mutual information and redundancy (for details, see Grun, ¨ 1996). For each particular activity constellation v k, mupred
tual dependence MD k is dened in terms of the joint-probability P k expected under the null-hypothesis (see equation 2.3) and its empirical counterpart, emp
emp
Pk
D
nk
,
T
(C.1)
by emp
MD k D ln
Pk
pred
.
(C.2)
Pk
Thereby, we obtain: if if if
emp
Pk
emp Pk emp Pk
pred
< Pk D
>
pred Pk pred Pk
then
MDk < 0 :
“negative” dependence
then
MDk D 0 :
independence
then
MDk > 0 :
(C.3)
“positive” dependence.
Hence, any deviation of MDk from 0 will indicate deviations from the null-hypothesis of independence for the corresponding activity constellation v k. Note that mutual dependence is a time-averaged measure over the entire duration of the spike trains under observation. We can make this into
74
S. Grun, ¨ M. Diesmann, and A. Aertsen
Figure 10: Time-resolved mutual dependence. The dot display (top) and the time-resolved mutual dependence (bottom) are shown for a simulated data set, into which coincident spikes were injected (same data as in Figure 3B) . In contrast to Figure 3B, spike data of the trials are concatenated and, individually for each neuron, represented as a single continuous series of dots.
a “local” time measure, however, by replacing each individual vector v k emp at the points in time where it occurs tjk, j 2 f1, . . . , n k g by the associated mutual dependence value MDk. This leads to a time-varying function: the time-resolved mutual dependence MD(t) : emp
MD( t) D
m n k X X kD 1 j D 1
MDk ¢d( t ¡ tjk ) .
(C.4)
The delta function in equation C.4 selects the MD value corresponding to the coincidence pattern occurring at the time of interest. The resulting time series describes the individual contributions to the MD in the course of time. The result is a discontinuous, “peaky” function of time, as shown in Figure 10 for a simulated data set with injected coincident events. The amplitudes, ranging from negative to positive values, typically vary strongly from one time step to the next. Positive amplitudes tend to have higher values than negative ones and pop out from small uctuations around 0. Positive peaks “point” to spike constellations for which the probability of occurrence is higher than expected, assuming independent processes. Negative peaks “point” to spike constellations with probability of occurrence lower than expected. Thus, the time-resolved mutual dependence may be interpreted as a “dynamic pointer” to instances of joint spike constellations representing conspicuous deviations from independence. A comparison of the performance of MD to the joint-surprise with regard to false positives and false negatives (cf. section 4) revealed that the
Unitary Events: I. Detection and Signicance
75
Figure 11: (A) Selectivity and sensitivity of the mutual dependence for pair coincidences injected into two parallel processes as a function of ring rates. (B) Results for the joint-surprise for comparison. The percentage of false positives (fp: left column), false negatives (fn: middle column), and the resulting overlap of maximum 10% fp and maximum 10% fn (right column). The thresholds on the MD were varied from h D ¡2 to 2 in equidistant steps. The dashed line indicates the MD-value corresponding to Sa (a D 0.05) for background rates l D 10s¡1 . Further details as in Figure 8.
MD measure, in contrast to the joint-surprise, strongly depends on the ring rates of the neurons involved (see Figure 11). This dependence expresses itself in the curved shape of the sensitivity-selectivity overlap region (the white area in Figure 11A, right). As a result, different signicance thresholds would be required for different ring rates. Thus, to obtain an adequate performance of the MD, the threshold needs to be adjusted, point by point, to the associated rates. Evidently, this is unpractical, considering that the observed processes typically have different individual ring rates. In addition, rate dependence poses problems when comparing data from different experiments. For the above reasons, we do not pursue this measure further here; more details and results of application to neuronal data can be found in Grun ¨ (1996).
76
S. Grun, ¨ M. Diesmann, and A. Aertsen
Acknowledgments
We thank Moshe Abeles, George Gerstein, and Gunther ¨ Palm for many stimulating discussions and help in the initial phase of the project. We also thank Robert Gutig, ¨ Stefan Rotter, and Wolf Singer for their constructive comments on an earlier version of the manuscript for this article. This work was partly supported by the DFG, BMBF, HFSP, GIF, and Minerva. References Abeles, M. (1982a). Local cortical circuits: An electrophysiological study. Berlin: Springer-Verlag. Abeles, M. (1982b). Quantication, smoothing, and condence limits for singleunits’ histograms. J. Neurosci., 5, 317–325. Abeles, M. (1982c). Role of cortical neuron: Integrator or coincidence detector? Israel J. Med. Sci., 18, 83–92. Abeles, M. (1983). The quantication and graphic display of correlations among three spike trains. IEEE Trans. Biomed. Eng., 30, 236–239. Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Abeles, M., Bergman, H., Margalit, E., & Vaadia, E. (1993). Spatiotemporal ring patterns in the frontal cortex of behaving monkeys. J. Neurophysiol., 70(4), 1629–1638. Abeles, M., & Gerstein, G. L. (1988). Detecting spatiotemporal ring patterns among simultaneously recorded single neurons. J. Neurophysiol., 60(3), 909– 924. Abeles, M., Vaadia, E., Prut, Y., Haalman, I., & Slovin, H. (1993). Dynamics of neuronal interactions in the frontal cortex of behaving monekys. Conc. Neurosci., 4(2), 131–158. Aertsen, A., & Arndt, M. (1993). Response synchronization in the visual cortex. Curr. Op. Neurobiol., 3, 586–594. Aertsen, A., Gerstein, G., Habib, M., & Palm, G. (1989).Dynamics of neuronal ring correlation: Modulation of “effective connectivity.” J. Neurophysiol., 61(5), 900–917. Aertsen, A., Vaadia, E., Abeles, M., Ahissar, E., Bergman, H., Karmon, B., Lavner, Y., Margalit, E., Nelken, I., & Rotter, S. (1991).Neural interactions in the frontal cortex of a behaving monkey: Signs of dependence on stimulus context and behavioral state. J. Hirnf., 32(6), 735–743. Ahissar, M., E., A., Bergman, H., & Vaadia, E. (1992). Encoding of sound-source location and movement: Activity of single neurons and interactions between adjacent neurons in the monkey auditory cortex. J. Neurophysiol.,67, 203–215. Baker, S., & Lemon, R. (2000). Precise spatiotemporal repeating patterns in monkey primary and supplementary motor areas occur at chance level. J. Physiol. (Lond.), 84, 1770–1780. Barlow, H. B. (1972).Single units and sensation: A neuron doctrine for perceptual psychology? Perception, 1, 371–394.
Unitary Events: I. Detection and Signicance
77
Barlow, H. (1992). Single cells versus neuronal assemblies. In A. Aertsen & V. Braitenberg (Eds.), Information processing in the cortex (pp. 169–173). Berlin: Springer-Verlag. Brody, C. D. (1999a). Correlations without synchrony. Neural Comp., 11, 1537– 1551. Brody, C. D. (1999b). Disambiguating different covariation types. Neural Comp., 11, 1527–1535. Cox, D. R., & Isham, V. (1980). Point processes. London: Chapman and Hall. Date, A., Bienenstock, E., & Geman, S. (1998). On the temporal resolution of neural activity (Tech. Rep.). Providence, RI: Divison of Applied Mathematics, Brown University. Dayhoff, J. E., & Gerstein, G. L. (1983). Favored patterns in spike trains. II. Application. J. Neurophysiol., 49(6), 1349–1363. DeCharms, R., & Merzenich, M. (1996). Primary cortical representation of sounds by the coordination of action-potential timing. Nature, 381, 610– 613. Diesmann, M., Gewaltig, M.-O., & Aertsen, A. (1999). Conditions for stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., & Reitb ock, ¨ H. J. (1988). Coherent oscillations: A mechanism of feature linking in the visual cortex? Biol. Cybern., 60, 121–130. Eggermont, J. J. (1992). Neural interaction in cat primary auditory cortex II. Effects of sound stimulation. J. Neurophysiol., 71, 246–270. Engel, A. K., Konig, ¨ P., Schillen, T. B., & Singer, W. (1992). Temporal coding in the visual cortex: New vistas on integration in the nervous system. TINS, 15(6), 218–226. Feller, W. (1968). An introduction to probability theory and its applications (Vol. 1, 3rd ed.). New York: Wiley. Fetz, E. E. (1997).Temporal coding in neural populations. Science, 278, 1901–1902. Georgopoulos, A. P., Taira, M., & Lukashin, A. (1993). Cognitive neurophysiology of the motor cortex. Science, 260, 47–52. Gerstein, G. L., Bedenbaugh, P., & Aertsen, A. (1989). Neuronal assemblies. IEEE Trans. Biomed. Eng., 36, 4–14. Grammont, F., & Riehle, A. (1999). Precise spike synchronization in monkey motor cortex involved in preparation for movement. Exp. Brain Res., 128, 118–122. Gray, C. M., & Singer, W. (1989). Stimulus-specic neuronal oscillations in orientation columns of cat visual cortex. Proc. Nat. Acad. Sci. USA, 86, 1698–1702. Grun, ¨ S. (1996). Unitary joint-events in multiple-neuron spiking activity: Detection, signicance, and interpretation. Thun: Verlag Harri Deutsch. Grun, ¨ S., & Aertsen, A. (1998). Unitary events in non-stationary multiple-neuron activity. Society for Neuroscience Abstracts, 24, 1670. Grun, ¨ S., Aertsen, A., Abeles, M., & Gerstein, G. (1993). Unitary events in multineuron activity: Exploring the language of cortical cell assemblies. In N. Elsner & M. Heisenberg (Eds.), Gene—Brain—Behaviour. Gottingen: ¨ Thieme Verlag.
78
S. Grun, ¨ M. Diesmann, and A. Aertsen
Grun, ¨ S., Aertsen, A., Abeles, M., Gerstein, G., & Palm, G. (1994). Behaviorrelated neuron group activity in the cortex. In Proc. 17th Ann. Meeting European Neurosci. Assoc. Oxford: Oxford University Press. Grun, ¨ S., & Diesmann, M. (2000). Evaluation of higher-order coincidences in multiple parallel proceses. Society for Neuroscience Abstracts, 26, 2201. Grun, ¨ S., Diesmann, M., Grammont, F., Riehle, A., & Aertsen, A. (1999).Detecting unitary events without discretization of time. J. Neurosci. Meth., 94, 67–79. Gutig, ¨ R., Aertsen, A., & Rotter, S. (in press). Statistical signicance of coincident spikes: Count-based versus rate-based statistics. Neural. Comp. Hatsopoulos, N. G., Ojakangas, C. L., Paninski, L., & Donoghue, J. P. (1998). Information about movement direction obtained from synchronous activity of motor cortical areas. Proc. Nat. Acad. Sci. USA, 95, 15706–15711. Hays, W. L. (1994). Statistics (5th ed.). Orlando, FL: Harcourt Brace. Hebb, D. O. (1949).Organizationof behavior.A neurophysiologicaltheory. New York: Wiley. Hubel, D. H., & Wiesel, T. N. (1977). Ferrier lecture: Functional architecture of macaque monkey visual cortex. Proc. R. Soc. Lond. B., 198(1130), 1–59. Johannesma, P., Aertsen, A., van den Boogaard, H., Eggermont, J., & Epping, W. (1986). From synchrony to harmony: Ideas on the function of neural assemblies and on the interpretation of neural synchrony. In G. Palm & A. Aertsen (Eds.), Brain theory (pp. 25–47). Berlin: Springer-Verlag. Laubach, M., Wessberg, J., & Nicolelis, M. (2000). Cortical ensemble activity increasingly predicts behaviour outcomes during learning of a motor task. Nature, 405, 567–571. Legendy, C. R. (1975). Three principles of brain function and structure. Intern. J. Neurosci., 6, 237–254. Legendy, C. R., & Salcman, M. (1985). Bursts and recurrences of bursts in the spike trains of spontaneously active striate cortex neurons. J. Neurophysiol., 53(4), 926–939. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Martignon, L., Deco, G., Laskey, K., Diamond, M., Freiwald, W., & Vaadia, E. (2000). Neural coding: Higher-order temporal patterns in the neurostatistics of cell assemblies. Neural Comp., 12, 2621–2653. Martignon, L., Laskey, K., Deco, G., & Vaadia, E. (1997). Learning exact patterns of quasi-synchronization among spiking neurons from data on multi-unit recordings. In M. Jordan & M. Mozer (Eds.), Advances in information processing systems, 9 (pp. 145–151). Cambridge, MA: MIT Press. Martignon, L., von Hasseln, H., Grun, ¨ S., Aertsen, A., & Palm, G. (1995). Detecting higher-order interactions among the spiking events in a group of neurons. Biol. Cybern., 73, 69–81. Murthy, V. N., & Fetz, E. E. (1992). Coherent 25- to 35- Hz oscillations in the sensorimotor cortex of awake behaving monkeys. Proc. Nat. Acad. Sci. USA, 89, 5670–5674. Nadasdy, Z., Hirase, H., Czurko, A., Csicsvari, J., & Buzsaki, G. (1999). Replay and time compression of recurring spike sequences in the hippocampus. J. Neurosci., 19(21), 9497–9507.
Unitary Events: I. Detection and Signicance
79
Nawrot, M., Aertsen, A., & Rotter, S. (1999). Single-trial estimation of neuronal ring rates. J. Neurosci. Meth., 94, 81–91. Nawrot, M., Riehle, A., Aertsen, A., & Rotter, S. (2000). Spike count variability in motor cortical neurons. Eur. J. Neurosci., 12 (suppl. 11), 506. Newsome, W. T., Britten, K. H., & Movshon, J. A. (1989). Neuronal correlates of a perceptual decision. Nature, 341(6237), 52–54. Nicolelis, M. A. L., Baccala, L. A., Lin, R. C. S., & Chapin, J. K. (1995). Sensorimotor encoding by synchronous neural assembly activity at multiple levels in the somatosensory system. Science, 268, 1353–1358. Palm, G. (1981). Evidence, information and surprise. Biol. Cybern., 42, 57–68. Palm, G. (1990). Cell assemblies as a guidline for brain reseach. Conc. Neurosci., 1, 133–148. Palm, G., Aertsen, A., & Gerstein, G. L. (1988). On the signicance of correlations among neuronal spike trains. Biol. Cybern., 59, 1–11. Pauluis, Q. (1999). Temporal coding in the superior colicullus. Unpublished doctoral dissertation, Universit e catholique de Louvain, Brussels, Belgium. Pauluis, Q., & Baker, S. N. (2000). An accurate measure of the instantaneous discharge probability, with application to unitary joint-event analysis. Neural Comp., 12(3), 647–669. Perkel, D. H., & Bullock, T. H. (1968). Neural coding. Neurosci. Res. Prog. Bull., 6. Pipa, G., Singer, W., & Grun, ¨ S. (2001). Non-parametric signicance estimation for unitary events. In N. Elsner & G. M. Kreutzberg (Eds.), Proc. of the German Neuroscience Society. Stuttgart: Thieme Verlag. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C (2nd ed.). Cambridge: Cambridge University Press. Prut, Y., Vaadia, E., Bergman, H., Haalman, I., Hamutal, S., & Abeles, M. (1998). Spatiotemporal structure of cortical activity: Properties and behavioral relevance. J. Neurophysiol., 79(6), 2857–2874. Riehle, A., Grammont, F., Diesmann, M., & Grun, ¨ S. (2000). Dynamical changes and temporal precision of synchronized spiking activity in motor cortex during movement preparation. J. Physiol. (Paris), 94(5–6), 569–582. Riehle, A., Grun, ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and rate modulation differentially involved in motor cortical function. Science, 278, 1950–1953. Roelfsema, P., Engel, A., Konig, ¨ P., & Singer, W. (1996). The role of neuronal synchronization in response selection: A biologically plausible theory of structured representations in the visual cortex. J. Cogn. Neurosci., 8(6), 603– 625. Roy, A., Steinmetz, P., & Niebur, E. (2000). Rate limitations of unitary event analysis. Neural Comp., 12, 2063–2082. Sakurai, Y. (1996). Population coding by cell assemblies—what it really is in the brain. Neuroscience Research, 26, 1–16. Sanes, J. N., & Donoghue, J. P. (1993). Oscillations in local eld potentials of the primate motor cortex during voluntary movement. Proc. Nat. Acad. Sci. USA, 90, 4470–4474.
80
S. Grun, ¨ M. Diesmann, and A. Aertsen
Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18(10), 3870–3896. Singer, W. (1993). Synchronization of cortical activity and its putative role in information processing and learning. Annu. Rev. Physiol., 55, 349–374. Singer, W. (1999). Neural synchrony: A versatile code for the denition of relations. Neuron, 24, 49–65. Singer, W., Engel, A. K., Kreiter, A. K., Munk, M. H. J., Neuenschwander, S., & Roelfsema, P. R. (1997). Neuronal assemblies: necessity, signature and detectability. Trends in Cognitive Sciences, 1(7), 252–261. Singer, W., & Gray, C. (1995). Visual feature integration and the temporal correlation hypothesis. Annu. Rev. of Neurosci., 18, 555–586. Softky, W. R., & Koch, C. (1993). The highly irregular ring of cortical cells is inconsistent with temporal integration of random EPSPS. J. Neurosci., 13, 334–350. Steinmetz, P., Roy, A., Fitzgerald, P., Hsiao, S., Johnson, K., & Niebur, E. (2000). Attention modulates synchronized neuronal ring in primate somatosensory cortex. Nature, 404(6774), 187–190. Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neuronal interactions in monkey cortex in relation to behavioural events. Nature, 373(6514), 515–518. Villa, A., & Abeles, M. (1990). Evidence for spatiotemporal ring patterns within the auditory thalamus of the cat. Brain Res., 509(2), 325–327. von der Malsburg, C. (1981). The correlation theory of brain function (Int. Rep. No. 81-2). Gottingen: ¨ Max-Planck-Institute for Biophysical Chemistry. Received July 20, 2000; accepted April 24, 2001.
LETTER
Communicated by George Gerstein
Unitary Events in Multiple Single-Neuron Spiking Activity: II. Nonstationary Data Sonja Grun ¨
[email protected] Department of Neurophysiology, Max-Planck Institute for Brain Research, D-60528 Frankfurt/Main, Germany Markus Diesmann
[email protected] Department of Nonlinear Dynamics, Max-Planck Institut fur ¨ Str¨omungsforschung, D-37073 G¨ottingen, Germany Ad Aertsen
[email protected] Department of Neurobiology and Biophysics, Institute of Biology III, Albert-LudwigsUniversity, D-79104 Freiburg, Germany In order to detect members of a functional group (cell assembly) in simultaneously recorded neuronal spiking activity, we adopted the widely used operational denition that membership in a common assembly is expressed in near-simultaneous spike activity. Unitary event analysis, a statistical method to detect the signicant occurrence of coincident spiking activity in stationary data, was recently developed (see the companion article in this issue). The technique for the detection of unitary events is based on the assumption that the underlying processes are stationary in time. This requirement, however, is usually not fullled in neuronal data. Here we describe a method that properly normalizes for changes of rate: the unitary events by moving window analysis (UEMWA). Analysis for unitary events is performed separately in overlapping time segments by sliding a window of constant width along the data. In each window, stationarity is assumed. Performance and sensitivity are demonstrated by use of simulated spike trains of independently ring neurons, into which coincident events are inserted. If cortical neurons organize dynamically into functional groups, the occurrence of near-simultaneous spike activity should be time varying and related to behavior and stimuli. UEMWA also accounts for these potentially interesting nonstationarities and allows locating them in time. The potential of the new method is illustrated by results from multiple single-unit recordings from frontal and motor cortical areas in awake, behaving monkey. Neural Computation 14, 81–119 (2001)
° c 2001 Massachusetts Institute of Technology
82
S. Grun, ¨ M. Diesmann, and A. Aertsen
1 Introduction
In the companion article in this issue, unitary event analysis was introduced to detect a certain type of statistical dependency in the spiking activities of simultaneously recorded neurons: near-coincident spike constellations that occur more often than expected on the basis of independent ring rates. In the literature, such events are discussed as signatures of coherent cell assemblies, considered to be the building blocks of cortical processing (see the companion article for references). Unitary event analysis as described in the companion article was based on the assumption that the underlying processes are stationary. Typically, however, experimental data show modulations in ring rates. In fact, often stimuli are manipulated to enhance these “responses.” In this article, we describe an extension of the stationary method, unitary events by moving window analysis (UEMWA), specically designed to enable an application to nonstationary data. After introducing and describing the method in detail (section 2), we illustrate its performance and discuss its sensitivity using simulated spike sequences under different scenarios of nonstationarities in ring rate and nonstationarities in coincidence rate, on the same and on different timescales (section 3). Two experimental data sets from frontal and motor cortical recordings are used to illustrate the occurrence of unitary events in neuronal data and their relation to behavioral context (section 4). In section 5, we concentrate on practical aspects of the application of our new method. An assessment of the problems of false positives and false negatives is followed by guidelines for proper choice of analysis parameters. 2 Detecting Unitary Events by Moving Windows Analysis
The task is to develop a method that allows the detection of unitary events in nonstationary spike data. The basic idea is to segment the data into sections over which stationarity can be assumed and analyze the data in these sections separately by the method developed in the companion article for the stationary situation. Here, we construct a procedure in which a time window of width Tw is slid along the data, and unitary event analysis is performed separately at each window position, dening slightly different rate environments (an alternative approach is described in appendix D and discussed in section 5). The moving window has to be narrow enough such that the ring rates can be assumed to be stationary and at the same time long enough to obtain sufcient statistics. It turns out that a third criterion, the time-dependent rate of the spike coincidences to be detected, also inuences the optimal choice of the analysis window size. In this section, we describe the procedure sketched above in detail and work out its statistical interpretation, using the results of the companion article. Peristimulus time histograms (PSTHs) of spike data from motor and frontal cortical areas show that ring rates are usually sufciently station-
Unitary Events: II. Nonstationary Data
83
ary (cf. section 4, Figures 6 and 7) over time windows on the order of 50 to 100 ms. However, considering a rate level of 1 to 50 s¡1 , we expect only 0.1 to 5.0 spikes in a single window. Clearly, these numbers are too low for a statistical comparison of the numbers of expected and observed coincidences. An important assumption of the PSTH is that ring rates are stationary across trials, and, thus, the average over trials allows computing a reliable estimate of the ring rate at any point in time. Using this assumption, the set of trials performed for a particular experimental condition can be combined to overcome the problem of the low numbers of counts stated above. Figure 1 illustrates how the data of all available trials are used to construct a new process. For a time window centered at ti , the data from the M trials are concatenated to form a new set of parallel spike trains of length M ¢ Tw . Let vj ( t) be the parallel ( 0, 1) -process (see the companion article) describing the neuronal spike data of trial j. In this article, all variables representing time are in units of the temporal resolution h of vj . In these units, the window width is an odd integer, Tw D (2n C 1) ,
n 2 f0, 1, 2, . . .g.
(2.1)
The new parallel process v on the new time axes t0 is given by v ( t0 ) D vj ( t)
(2.2)
³ ´ 1 t0 D ( j ¡ 1) ¢ Tw C t ¡ ti ¡ ( Tw ¡ 1) 2
(2.3)
with
where » ¼ 1 1 t 2 ti ¡ ( Tw ¡ 1) , . . . , ti C ( Tw ¡ 1) 2 2
(2.4)
j 2 f1, . . . , Mg.
(2.5)
The stationary unitary event analysis is then performed on these new parallel spike trains. It can be summarized as follows. The average ring probabilities of the processes within the time window determine the expected number of coincidences—the empirical number of coincidences results from a counting process within the window over all trials. The signicance for excess (or lacking) coincidences is evaluated by comparing the expected and empirical numbers using the joint-surprise measure, a logarithmic transform of the joint-p-value. The latter expresses the probability of observing the empirical number of coincidences by chance, under the null-hypothesis of independent Poisson processes. The full data set is analyzed by successively moving the time window from one position ti to the next (usually in steps of one bin) and repeating the procedure sketched above.
84
S. Grun, ¨ M. Diesmann, and A. Aertsen Tw trial 1
1 1
trial 2
2 2
trial M
M M
1
2
1
2
ti t
t i+1
M
ti M
t i+1 t’ Figure 1: Sketch of the moving window analysis. (A) Parallel spike trains of ve neurons (spikes marked as dots) for several repetitions 1, 2, . . . , M (trials) of the same experiment. A window of width Tw centered at a given point in time ti denes the segment of the data (shaded in gray), which enters the analysis at ti . (B) From the data in each such time segment, a new time axis t0 is constructed by concatenating the windows from all trials. Unitary events analysis is then performed on this new process. The full data set is analyzed by successively moving the window to the next point in time and repeating the above procedure (indicated by dashing). Typically, the window is shifted in steps of the time resolution of the spike data.
3 Dependence of Signicance on Spike Rates
In this section we describe the performance of the UEMWA method under different conditions, including various time courses of the ring rates and the coincident rate. Special emphasis will be put on the width of the analysis time window, which sets detectability limits. After introducing a common theoretical framework, we discuss stationary and nonstationary rates. We rst formulate a thought experiment in which we analytically describe the variables of the processes as expectation values. This will serve as a de-
Unitary Events: II. Nonstationary Data
85
scription of the average realization. Theoretical results will be illustrated by actual realizations in the form of simulated data. 3.1 Description of the Performance Test. For convenience we will restrict our considerations to two processes, both having the same rates. The results, however, are easily extended for N > 2 processes and for differing rates. Dependencies between the two processes and their consequences on our measures are studied by injecting coincident Poisson events at a given rate level lc ( t) into two independent Poisson processes with (background) rates l(t) (for a detailed discussion, see Grun, ¨ Diesmann, Grammont, Riehle, & Aertsen, 1999). Temporal resolution h is assumed to be 1 ms. Let us now derive expectation values for the empirical number of coincidences nemp in time interval Tw and for the number of coincidences npred we expect to nd on the basis of the rates. The probability of nding a coincidence at time t is lc ( t) h C ( l( t) h ) 2 . Therefore, the expectation value for the empirical coincidence count in T w is
nemp ( t ) D M ¢
( Tw ¡1) t C 12X t D t¡ 12 ( Tw ¡1)
[lc ( t ) h C (l( t ) h ) 2 ].
(3.1)
Knowing only the rate of the processes ( lc ( t) C l( t) ), the expected number of coincidences in a window Tw centered at t is pred n¤ ( t )
D M¢
( Tw ¡1) t C 12X t Dt¡ 12 ( Tw ¡1)
[( lc ( t ) C l( t ) ) h]2 .
(3.2)
However, given experimental data, the ring rates are not known and have to be estimated from the data. Assuming that the rates are stationary over the duration of a time window Tw , the rate can be estimated by the ratio of the spike count and the number of time steps. (The effect of rate estimation on stationary unitary events analysis is discussed in the companion article.) Having available only an estimate of the average ring rate, equation 3.2 reads 2 1 npred ( t ) D MTw ¢ 4 Tw
( Tw ¡1) t C 12X t Dt¡ 12
( Tw ¡1)
32 ( lc ( t ) C l( t ) ) h5 .
(3.3)
Comparing equation 3.3 with equation 3.2, we have, using the assumption of stationarity, effectively exchanged the sum and the squaring. The difference pred between npred ( t) and n¤ ( t) determines the error made when nonstationary processes are analyzed with an averaging window of width Tw . (See appendix B for a parametric study of this deviation in terms of false positives.) However, conceptually, rate estimation and coincidence statistics can
86
S. Grun, ¨ M. Diesmann, and A. Aertsen
be separated and performed using different methods (see section 5). The following measures will be analyzed and illustrated for different time courses of the background rates l( t) and the injected rate lc ( t ) : The number of coincidences expected to occur in the course of time. Both the number of occurrences assuming independence (npred; equation 3.3) and the empirical number of coincidences (nemp ; equation 3.1) will be evaluated. The joint-surprise S as a function of time, resulting from the comparison of npred and nemp . The rst characterizes the underlying distribution under the null-hypothesis of independence, and the second describes the deviation from independence. Data and results will be displayed as sequences of three subgures (columns in Figures 2 and 4): at the top empirical (solid line) and expected coincidence rates (dotted line), in the middle the number of coincidences (empirical [solid] and expected [dotted]), at the bottom the joint-surprise (gray) including the upper and lower signicance levels for a D 0.01 (dashed). All measures are shown as functions of time over the duration of the trial. 3.2 Stationary Background Rates
3.2.1 Stationary Coincidence Rate. The simplest “experimental” situation is given if both background rates and the injected coincidence rate are staFigure 2: Facing page. Stationary background rates: relevance of window size. (A) Stationary coincidence rate. Three examples of sliding window analysis for different window widths (columns from left to right, Tw D 50, 220, 400; M D 100, h D 1 ms). (Top) Coincident rate (dotted line) and compound rates (solid line). (Middle) Number of coincidences empirical (solid line) and expected (dotted line). (Bottom) Joint-surprise (gray curve) and upper and lower signicance threshold (dash-dot lines) for signicance level a D 0.01. In the left column, Tw is smaller than the minimal window size Ta ; the number of detected coincidences is not signicant. In the middle column, Tw equals Ta and thus is at the border of signicance. Only when Tw > Ta (right column) is the detected number of coincidences signicant. All measures are independent of time because of stationarity of the underlying rates. (B) Nonstationary coincidence rate. Three examples of sliding window analysis for different window widths (columns from left to right Tw D 30, 100, 500). Graphs as in A. In the left column, Tw is narrower than the minimal window size Tmin ; the number of detected coincidences is not signicant. In the middle column, Tw equals Tc and, hence, S has a triangular shape. Since Tw is larger than Tmin , signicance is reached for more than one window position. In the right column, Tw > Tmax ; the hot region is not detected as signicant.
Unitary Events: II. Nonstationary Data
250 500 7501000 time (ms)
0
0
l (s 1)
Significance
30 20 10 0 4 2 0 2 4
n(T =100) w
Significance
0
250 500 7501000 time (ms)
250 500 7501000 time (ms) Rate
20 10 0
n
4 2 0 2 4
w
250 500 7501000 time (ms)
250 500 7501000 time (ms)
S
l (s 1) n
30 20 10 0
n(T =30)
0
Significance
Rate 20 10 0
S
l (s 1) n S
4 2 0 2 4
w
Significance
Rate
30 20 10 0
4 2 0 2 4
n(T =400)
w
Significance
20 10 0
30 20 10 0
n(T =220)
w
0
l (s 1)
4 2 0 2 4
n
30 20 10 0
n(T =50)
Rate 20 10 0
S
l (s 1) n
4 2 0 2 4
Rate 20 10 0
S
l (s 1) n
30 20 10 0
S
Rate 20 10 0
87
n(T =450) w
Significance
0
250 500 7501000 time (ms)
tionary, that is, independent of time (see Figure 2A). As shown in section C.1, for each combination of l and lc , there is a minimal analysis window width Ta needed to detect coincidences as signicant events. If the window is chosen too narrow, the number of excess coincidences is too small to deviate signicantly from the expected number. In the stationary case, detection of excess coincident events can always be ensured by increasing the analysis window, since the larger the analysis window is, the more excess coincidences are detected. The examples in Figure 2A demonstrate the dependence of the signicance on the size of the analysis window (columns: Tw < Ta , Tw D T a, Tw > Ta ) for a xed combination of l and lc . Observe that in the left column, Tw is clearly too small to detect the injected coincidences, just on the border for detection in the middle column, and clearly large enough in the right column.
88
S. Grun, ¨ M. Diesmann, and A. Aertsen
3.2.2 Nonstationary Coincidence Rate. We now discuss the case where coincidences occur clustered in time (i.e., form “hot regions”) on top of a constant background rate (see Figure 2B). In that case, the detection of excess coincidences is constrained to a range between a minimal and a maximal analysis window size. Consider a situation where we place the analysis window in the middle of a hot region of width Tc , and gradually increase the width of the analysis window Tw (details are in section C.2; cf. Figure 10B). As long as Tw · Tc , we face the situation of stationary injected events, discussed in the preceding paragraph (cf. Figure 10A). Hence, we need a minimum width of the analysis window (Ta ) to detect the cluster of injected events. Ta depends only on the combination of l and lc ; its value can be obtained from the calibration graph for the stationary situation (see Figure 10A, bottom). When the analysis window exceeds the hot region (Tw > Tc ), the total number of coincidences increases further; now, however, due only to coincidences occurring by chance on the basis of the background activity. Thus, by increasing the analysis window further, excess coincidences are averaged with independent coincidences, and at some window size, Tmax will not be detected as signicant anymore. Tmax denes the maximal window for detecting excess coincidences as signicant. If Tc < Ta , the cluster cannot be detected, even with arbitrarily large analysis windows (assuming the number of trials to be xed). Thus, in contrast to the stationary condition, the existence of a Tmin depends on the width of the hot region Tc . A cluster of excess coincidences is not detectable if its duration Tc remains below the critical time span Ta . If the cluster is detectable (Tc ¸ T a, Tmin D Ta ), it may still go undetected if the analysis window is too small, Tw < Tmin , or too wide, Tw > Tmax . The range of appropriate window sizes can be obtained from calibration graphs as in Figure 10B in appendix C. Examples for analysis windows that are too narrow, appropriate, and too wide are presented in Figure 2B. Let us now discuss the case where the analysis window is shifted gradually into a hot region. Once the window has overlap with the hot region, the injected coincidences contribute to the coincidence count. With increasing overlap, the number of contributed coincidences grows linearly, and the joint-surprise increases accordingly (see Figure 2B). A plateau is reached when the analysis window is completely inside the hot region (Tw < Tc ) or when the hot region is completely covered by the analysis window (Tw > Tc ). Further shifting eventually leads to a decrease in overlap and to a time course of the joint-surprise S, which is symmetrical around the center of the hot region. The trapezoidal shape of S degenerates to a triangle in the special case Tw D Tc . The argument just given shows that S can pass the signicance threshold (i.e., the plateau surpasses threshold) only if the analysis window Tw has the appropriate size Tmin · Tw · Tmax . The duration of the plateau is given by Tp D |Tc ¡Tw |. Table 1 summarizes the relationships between the observable
Unitary Events: II. Nonstationary Data
89
Table 1: Relationships of the Plateau Duration Tp , Width of the Analysis Window Tw , and Extent of the Hot Region Tc .
Tp D 0 Tp < Tw Tp ¸ Tw
Tw < Tc
Tw D Tc
Tw > Tc
— Tc D Tw C Tp Tc D Tw C Tp
Tc D Tw — —
— Tc D Tw ¡ Tp —
Notes: Tw is compared to Tp (rows) and to Tc (columns). Table entries marked by a dash represent nonexisting (Tp ,Tw ,Tc ) combinations.The trapezoidal shape of the jointsurprise S reaches signicance if Tmin exists (Tc ¸ Ta ) and Tmin · Tw · Tmax .
variables Tw , Tp and the variables generating the trapezoid Tw , Tc . Using these relationships, one can determine the extent of the hot region Tc by measuring the size of the plateau and systematic variation of the analysis window (see section C.2 for a detailed derivation). Figure 3 summarizes the possible interactions of the width of an excess interval Tc and the width of the analysis window Tw . If the signicance threshold is reached at all, typically more than one window is signicant. Only in the special case of Tw D Tc D Ta is the maximum of S exactly at threshold level and for a single window only. For a given combination Tc , Tw , we can compute the minimal overlap of the two windows needed to detect the injected coincidences as signicant by the use of T max and can construct the extent of the region Ts in which injected coincidences are marked as signicant (see section C.2)—that is, the time span between the two intersections of S with the signicance threshold. According to the unitary event analysis, all coincidences in a signicant window are marked as “special.” Thus, only in the case of minimal Ts exactly the coincidences in the region Tc are marked as special. The smaller Tw , the better the extent of the marked coincidences approximates Tc . 3.3 Nonstationary Background Rates
3.3.1 Stationary Coincidence Rates. We consider two different cases of nonstationary background rates: stepwise and gradual increase of l( t) , both in combination with a stationary rate of injected coincidences lc (see Figures 4A and 4B). A stepwise increase in rate does not lead to a discontinuous change in the coincidence counts and the signicance measure, due to the smoothing effect of the moving window (see Figure 4A). On a larger timescale, npred and nemp increase parabolically because of their quadratic dependence on the background rate (see Figure 4A, middle graph). The difference between npred and nemp is constant throughout the trial. However, due to the absolute increase in background rate, the signicance decreases
90
S. Grun, ¨ M. Diesmann, and A. Aertsen
a
1 2
1
2
a
or
1 2
1 l
1
1
2
2
or
Figure 3: Detectability of short epochs of excess coincidences. The gure illustrates the inuence of the choice of Tw for the detectability of a hot region Tc . l and lc are constant, and Tc ¸ Ta (detectability in principle) is assumed. Each individual graph shows the time course of lc (solid) in its top part (width of analysis window Tw indicated by the dashed line) and the time course of the signicance measure (solid) relative to the signicance threshold (the dotted line) in the bottom part. The ordinates of the signicance measure are individually scaled for better visibility. The rst column covers the constellations where Tw < Tc , the second Tw D Tc , and the third Tw > Tc . The rows are organized by the size of Tw relative to the interval [Tmin , Tmax ] for which detection is possible. The rst row depicts cases where injected coincidences are not detected as signicant because Tw is outside the interval [Tmin , Tmax ]. The second row shows cases where injected coincidences are just at the border of detectability, because Tw equals either Tmin or Tmax . For the case of Tw D Tc and Tw equal to one of the detection boundaries, Tmin and Tmax are identical. The third row illustrates cases where Tw is inside the interval [Tmin , Tmax ]. Here, coincidences are detected as signicant for a range of window positions.
(see the companion article for an extended discussion of this issue). A linear increase of background rate (see Figure 4B) basically gives the same result. In both examples, the injected coincidences (lc D 2 s¡1 ) do not reach signicance (a D 0.01) at a background rate of l D 50 s ¡1 . Above this rate, injected coincidences can be detected only with a larger analysis window Tw . 3.3.2 Nonstationary Coincidence Rates. Finally, we investigate the general case where the neuronal processes have time-dependent rates and the excess coincident activity occurs in a short interval, triggered by some external or internal event. When the neuronal processes are observed over repeated trials, the coincident activity appears to some degree locked to certain points
Unitary Events: II. Nonstationary Data
91
in time. For the purpose of this article, the situation described above is our model for the composition of neuronal spike trains. Firing rates and coincidence rates vary independently and consistently over the time course of a trial. Regions of increased coincidence rate may be accompanied with elevated ring rates, with suppressed activity, or without noticeable changes in ring rate. Such data are optimally suited for UEMWA. Reproducibility over trials allows for reliable estimates of ring rates and coincidence rate in relatively narrow time windows. Consider a data set where several hot regions appear during the trial, while the ring rate is increasing with time. As in the preceding section, two types of increase (stepwise in Figure 4C and a constant slope in Figure 4D) are compared. The width of the analysis window is chosen as Tw D Tc . We again analyze the situation using the theory for the expectation values worked out in section 3.1. Figures 4C and 4D show the results of this analysis. Overall, npred and nemp are increasing over time; in addition, nemp exhibits strong peaks in the hot regions. S reects the transients in the hot regions while staying at naught in between. This clearly demonstrates the ratenormalizing property of the joint-surprise measure. The triangular shape of the peaks is explained by the condition T w D Tc (cf. Figure 2B, center column). Since lc is the same for each hot region, the peaks in nemp relative to baseline are of equal height (see equation 3.1). With a constant number of excess coincidences and increasing background level, the signicance decreases (cf. Figure 4 in the companion article and Figure 4A here). Thus, in our example, the height of the peaks in S (see Figures 4C and 4D) decreases over time. The last hot region is just on the border of detectability. Apart from differences in the uctuations of S, caused by the discretization inherent in the joint-surprise measure (cf. Figure 10), the two types of rate variations are 6 practically indistinguishable at the level of S. In case Tw D Tc (not shown here), the time course of S would exhibit plateau-like shapes around the hot regions. For the case of linearly increasing background rates, the plateaus would be oblique instead of at. The slope of the plateau would then be comparable to the situation of stationary coincidences rates (cf. Figure 4B). Before we apply UEMWA to experimental data, we will leave the theory for the expectation values and illustrate the procedure using simulated point processes with a realistic number of repetitions. Figure 5A shows simulations of two parallel processes in repeated trials. Both spike trains are simulated as independent Poisson processes (see the companion article for details). The rst one has a nonstationarity in ring rate: at a certain point in time, the ring rate raises stepwise. The second process is stationary. Clusters of coincident events were injected around two points in time. In Figure 5A, all coincidences found (irrespective of their signicance, termed “raw”) are marked by squares. The two clusters can clearly be seen. However, as expected, coincidences also occur outside the hot regions—more of them in the regime where the ring rate of one of the neurons is elevated. In Figure 5B, only those coincidences are marked that occur in windows
92
S. Grun, ¨ M. Diesmann, and A. Aertsen
n
40
l (s 1)
60 40 20 0
Rate
Coincidence counts
20
Significance
5 0
S
S
0 Significance
0 5
Coincidence counts
20
0 5
60 40 20 0 40
n
l (s 1)
Rate
5
0 250 500 7501000 time (ms)
0
20 0
S
Coincidence counts
60 40 20 0 40
n
n
40
l (s 1)
60 40 20 0
Rate
20
5
5
0
0
5
0 250 500 7501000 time (ms)
Coincidence counts
0
Significance S
l (s 1)
Rate
250 500 7501000 time (ms)
5
Significance
0
250 500 7501000 time (ms)
where S exceeds the signicance threshold. The time course of the jointsurprise is shown in Figure 5C. As expected from our considerations above, the joint-surprise remains at baseline outside the hot regions. Thus, only coincidences in the hot regions and, because of the limited temporal resolution of UEMWA, in a small region around them are marked. The triangular shape of the peaks in S indicates that Tw was close to Tc . Only the top of the triangle is above threshold, meaning that Tw is not much larger than Tmin . Consequently, the region of marked coincidences in Figure 5B gives a good estimate of the width of the cluster Tc . The uctuations of S in repeti-
Unitary Events: II. Nonstationary Data
93
tions of the same experiment are illustrated in Figure 5D. The time course of the boundaries enclosing at least 70% of the realizations (dark gray curves) shows that the coincidences in the rst hot region are detected with a probability exceeding 85% (already the lower boundary is above signicance threshold). Sensitivity is lower for the second hot region because of a higher background rate. The behavior of the lower boundaries in the regime of low ring rates (the analysis window being outside the hot region) exemplies the problem of detecting a lack in the number of coincidences. The size of the analysis window and the given number of trials does not allow for the detection of lacking coincidences at a reasonable signicance level because the probability of nding no coincidences already is ¼14% (compare Aertsen & Gerstein, 1985). 4 Unitary Events in Cortical Activity
In the following, we present results from the analysis of simultaneously recorded multiple single-neuron spike trains from frontal and motor cortex in awake, behaving monkeys. Figure 4: Facing page. Nonstationary background rates (graphs as in Figure 2; in all graphs the analysis window is Tw D 50, M D 100, h D 1 ms). Stationary coincidence rates combined with (A) stepwise increasing background rates (l D 20, 30, 40, 50, 60 s ¡1 ) and (B) continuously increasing background rates (l D 20 to 60 s ¡1 ). Coincidences are injected at rate lc D 2 s ¡1 (dotted line). Compound rates are shown as solid curves. In A, coincidence counts (middle panel) reect rate changes, with some smoothing due to the moving window (dotted: expected; solid: empirical). Above a certain background rate, injected coincidences are masked by coincidences expected from independent rates and are no longer signicant (bottom panel, gray curve). For continuously increasing background rates (B, top panel, solid line), signicance is lost for ring rates that are too high, as in A (bottom panel, gray curve). Nonstationary coincidence rates combined with (C) stepwise increasing background rates (as in A) and (D) continuously increasing background rates (as in B). Coincidences are injected in three “hot regions” (width 50 ms, lc D 2 s ¡1 ). Coincidence counts (middle panel) reect increasing rates and injected coincidences in the hot regions (solid: empirical; dotted: expected). The triangular shape of coincidence counts results from smoothing by the moving analysis window, its width being equal to the widths of the hot regions (cf. Figure 3). In the bottom panels, joint-surprise remains at zero in regions where no coincidences are injected; signicant excursions occur in the hot regions. Typically, several consecutive windows detect coincidences as signicant. In the last hot region, only a single window is signicant, because the size of the hot region matches the size of the analysis window, which is the minimal window for detectability at this combination of rates. Results in C and D are comparable. Dash-dotted lines indicate signicance threshold (a D 0.01) for excess (upper) and lacking (lower) coincidences.
94
S. Grun, ¨ M. Diesmann, and A. Aertsen Raw Coincidences neuron #
1
2 Unitary Events
neuron #
1
2
Joint Surprise function
S
10 5 0
S
10
Variance of Joint Surprise function
5 0 0
200
400
600
800
1000
time (ms)
4.1 Motor Cortical Activity. In order to investigate the possible relation between the dynamics of neuronal interactions in the motor cortex and the behavioral reaction time (RT), a task was designed in which RT can be experimentally manipulated (Riehle, Seal, Requin, Grun, ¨ & Aertsen, 1995; Riehle, Grun, ¨ Diesmann, & Aertsen, 1997). Briey, monkeys were trained to touch a target on a video display after a preparatory period (PP) of variable duration. To start a trial, the animal had to push down a lever. The preparatory signal (PS) was given by an open circle on the video display. After a delay of variable duration, during which the animal had to continue to press the lever, the response signal (RS) was indicated by a lling circle. Four durations of the PP, lasting 600, 900, 1200, and 1500 ms, occurred with equal probability and in random order. RT is dened as the period between the occurrence of the RS and the release of the lever, whereas movement
Unitary Events: II. Nonstationary Data
95
time (MT) is dened as the period between releasing the lever and touching the screen. After training, the monkeys were prepared for multiple singleunit recording. A multielectrode microdrive (Reitbock, ¨ 1983; Mountcastle, Reitbock, ¨ Poggio, & Steinmetz, 1991) was used to insert transdurally seven independently driven microelectrodes, spaced 330 m m apart, into the primary motor cortex (MI) (for details see Riehle et al., 1995; Riehle, Grun, ¨ Aertsen, & Requin, 1996; Riehle et al., 1997). Figure 6 presents an example of modulation of coincident spiking activity during the preparation for movement. The rst observation is that the number of coincidences marked as signicant coincidences (see Figure 6C) is considerably reduced as compared to the raw coincidences (see Figure 6B). Second, unitary events show a distinct timing structure, with two phases of synchronized activity: about 100 ms after PS (lasting for about 200 ms) and after ES1 (lasting also about 200 ms). The composition of unitary events within these phases is the same: for a rst short period, neurons 2 and 3 are synchronized; then neuron 3 switches its partner and is successively synchronized with neuron 1. Taking into account the condition under which
Figure 5: Facing page. Simulated nonstationarities. (A) Spike times (dots) of two parallel processes, simulated for 1000 ms (h D 1 ms) over 100 trials (upper panel process 1, lower panel process 2, trials displayed in consecutive rows). Process 1 has a nonstationarity in ring rate at 300 ms from trial start. Firing rate increases stepwise from 20 s¡1 to 60 s ¡1 . Process 2 is stationary at 20 s ¡1 . Centered at 175 ms and 775 ms from trial onset, two hot regions (Tc D 50) are generated by injecting additional coincidences at rate 2 s¡1 . All coincidences occurring in the simulation are marked by squares (“raw coincidences”). (B) Same data as in A. Here, only coincidences are marked by squares that occur in analysis windows passing the signicance threshold: unitary events (analysis parameters: Tw D 50, a D 0.01). (C) Joint surprise corresponding to the data shown in A and B as a function of time (thick curve), representing at each instant in time the signicance resulting from the analysis window centered around this point in time. Thin lines: a D 0.01 for excess (upper) and lacking (lower) coincidences. At around 175 ms and 775 ms, the joint-surprise function passes the signicance level for excessive coincidences. (D) Variance of the joint-surprise functions estimated from 1000 repetitions of the simulation experiment shown in A–C. Signicance level indicated as in C for orientation. Gray curves represent the width of the distribution as a function of time (dark gray: minimum 70%; light gray: 95% area). In the regime where both background rates are 20 Hz, the probability of nding no coincidences is ¼ 14%. Therefore, no lower boundary for the minimum 95% area region can be drawn. No coincidence count does exist such that the cumulative probability of lower counts is less than 2.5%. For the 70% area region, the lower boundary is at coincidence count 1. In some time steps, probability to obtain no coincidences exceeded 15% because of the nite number of repetitions.
96
S. Grun, ¨ M. Diesmann, and A. Aertsen
Dot Display
neuron #
A 1 2 3 PS
neuron #
B
ES1 Raw Coincidences
1 2 3 Unitary Events
neuron #
C 1 2 3 0
time (ms)
600
Figure 6: Time structure of coincident spiking activity. The three panels of dot displays show the same spike data from three simultaneously recorded neurons in the primary motor cortex of a monkey involved in a delayed-response task. (A) Spiking activity of three neurons (1, 2, 3) organized in separate displays showing 96 trials. Data are pooled from three types of trials (PP 900, 1200, and 1500 ms and aligned on the preparatory signal (PS, vertical line at time 0). Only the rst 800 ms after PS are shown; this includes the end of the rst potential end of the waiting period ES1 (“expected signal”; vertical line at time 600 ms). In the data analyzed here, no movement instruction occurred at ES1. (B) Same spike data as in A. All “raw” coincidences are marked by squares (bin width 5 ms). (C) Same spike data as in A and B. Unitary events are marked by squares (UEMWA, window width 100 ms, a D 0.05). (Modied from Riehle et al., 1997).
Unitary Events: II. Nonstationary Data
97
these unitary events occur, one may speculate that the occurrence of unitary events can be interpreted as activation of a cell assembly that is involved with the initiation (or reinitiation) of a waiting period. 4.2 Frontal Cortical Activity. In the second experimental study we discuss, rhesus monkeys were trained to perform a “delayed localization task” with two basic paradigms (localizing and nonlocalizing; an example of the latter is shown Figure 7). In both paradigms, the monkey receives a sequence of two stimuli (visual and auditory) out of ve possible locations. After a waiting period, a “GO” signal instructs the monkey to move its arm in the direction of the stimulus relevant in the current trial. In the localizing paradigm, the relevant spatial cue was selected by the color of the GO signal. In the nonlocalizing paradigm, an indicator light between blocks of trials informed the monkey about the reinforced direction for arm movement. Thus, in the latter case, the animal had to ignore the spatial cues given before the GO signal. In the behavioral paradigm analyzed here (nonlocalizing), neither the spatial cues before the GO signal nor the GO signal itself could be used to determine the correct behavioral response (see Vaadia, Bergman, & Abeles, 1989; Vaadia, Ahissar, Bergman, & Lavner, 1991; Aertsen et al., 1991, for further details). The activity of several (up to 16) neurons from the frontal cortex was recorded simultaneously by six microelectrodes during performance of the task. In each recording session, the microelectrodes were inserted into the cortex with interelectrode distances of 300 to 600 m m. Isolation of single units was aided by six spike sorters that could isolate activity of two or three single units, based on their spike shape (Abeles & Goldstein, 1977). The spike sorting procedure introduced a dead time of 600 m s for the spike detection. Using data from this study, we found that coincident activity in the frontal cortex can be specic to movement direction. We parsed the data of ve neurons according to the movement direction and analyzed each of these subsets separately. Figure 7 shows the analysis results for two movement directions (A: to the left; B: to the front); for the three other movement directions, there was no signicant activity. For each of the two movement directions, there is mainly one cluster of unitary events (besides some sparsely spread individual ones), occurring at the onset of the movement. The clusters of unitary events differ, however, in both their neuronal composition and their timing. During movement to the left, signicant coincidences occur between neurons 6 and 9; for movement to the front, they occur between neurons 6 and 10. The timing of the unitary events differs when measured in absolute time after the GO signal (to the left: 355 ms; to the front: 400 ms); however, both occur shortly after LEAVE. Thus, unitary events appear to be locked better to the behavioral event (LEAVE) than to the external event (GO). The analysis of the same ve neurons during the localizing task, where the color of the GO signal contained the information about the reinforced type of stimulus (data not shown), did not reveal any indications for unitary
98
S. Grun, ¨ M. Diesmann, and A. Aertsen
Dot Display
6
6
7
7
neuron #
neuron #
Dot Display
8 9
10
8 9
10 GO
LEAVE HIT
GO
6
6
7
7
8 9
10
8 9
10 Unitary Events
Unitary Events
6
6
7
7
neuron #
neuron #
LEAVE HIT
Raw Coincidences
neuron #
neuron #
Raw Coincidences
8 9
10
8 9
10 0
329 557 time (ms)
0
364 562 time (ms)
Figure 7: Task dependence of coincident spiking activity. The dot displays show the spiking activity of ve simultaneously recorded neurons (labeled 6 to 10) from the frontal cortex of a monkey involved in a delayed localization task (28 trials). The two columns represent two different behavioral conditions: (A) movement to the left, (B) movement to the front. Organization of the columns (A, B) is the same as in Figure 5, with bin width of 3 ms for coincidence detection and an analysis window width of 60 ms, and 0.01 signicance level for UEMWA. Data are taken from segments starting 500 ms before and ending 700 ms after the GO signal (vertical line at time 0 ms). The top row dot displays include two behavioral events, LEAVE (monkey leaves central key) and HIT (monkey hits target), marked by diamonds and triangles, respectively. Average times of behavioral events are indicated by vertical lines labeled LEAVE (A: 329 ms; B: 364 ms) and HIT (A: 557 ms; B: 562 ms).
Unitary Events: II. Nonstationary Data
99
events related to movement direction. Note that neuron 6 is participating in signicant coincident activity in both movement directions, however, with another coincidence partner in each. This is indicative of a common membership of neuron 6 in two different cell assemblies, one of which is activated depending on the movement direction. 5 Discussion
We developed the unitary events analysis method to detect excess coincidences in multiple single-neuron spike trains. In the companion article, we evaluated the method for the case of stationary rates and calibrated the method for physiological relevant parameters: ring rates, coincidence rates, and number of neurons analyzed in parallel. The method was shown to be very sensitive to excess coincidences; their signicance can be evaluated using the joint-surprise measure. In this article, we extended the method to incorporate nonstationary ring rates by introducing the UEMWA. This method performs the analysis for unitary events within analysis windows of xed length, which are slid in small steps along the data, in order to follow the dynamic changes of ring rates. Within each window, we assume stationarity and apply unitary event analysis as in the stationary case. The resulting time-dependent joint-surprise function provides a convenient measure for the probability that the number of coincident spiking events in a certain observation interval represents a chance event. By imposing a threshold level on the joint-surprise function, certain time segments of the data are highlighted as potentially interesting regarding the presence of excess coincident spiking events, referred to as unitary events. Their neuronal composition, as well as the moments they occur in time, may give us information about the underlying dynamics of assembly activation. 5.1 Appropriate Size of Analysis Window. The width of the moving window is clearly an important parameter and may be adjusted according to the data. In the calibration study described in this article, (see section 3), we analyzed a model in which excess coincidences were injected into independent background activity. In a rst step (see section 3.2.1), we studied the sensitivity under stationary conditions and in dependence of background and coincidence rate levels, using analytical descriptions for coincidence counts. For a given rate constellation, an increase of the analysis window leads to a linear increase of the coincidence count. In order to reach signicance, a certain level of excess coincidences needs to be present to pop out from the chance coincidences due to background activity. This requires a minimum size of the analysis window, specic for the given background and coincidence rates. The larger the coincidence rate is, the smaller is the minimal window size. By contrast, the larger the background rate is, the larger is the minimal window size.
100
S. Grun, ¨ M. Diesmann, and A. Aertsen
In a second step (in section 3.2.2), we explored the detectability of nonstationary coincidence rates. We studied the case that excess coincidences occurred only within a restricted time interval (“hot regions”), motivated by experimental observations (Riehle et al., 1997). Such hot regions may be the result of loose time-locking of synchronous spiking to an (external) trigger event. By studying the detectability in the symmetrical case (the analysis window is centered in the hot region), we found that there is not only a minimal window size as discussed in the stationary condition but also a maximal window size. By increasing the analysis window (starting from a window size smaller than the width of the hot region), more and more excess coincidences are “seen,” which, upon reaching the minimal window size, are detected as unitary events. If the analysis window covers exactly the hot region, all injected coincidences are detected, and maximal detectability is reached, that is, the joint-surprise reaches its maximum. When further increasing the window, the number of excess coincidences no longer grows; however, the contribution of chance coincidences will increase, leading to a decrease of detectability until the joint-surprise nally drops below signicance. Thus, if the analysis window is too large or too small, the hot region is not detected, although the hot region would be detectable with the appropriate choice of analysis time window. Using the results from the symmetrical case, we analyzed the situation when the analysis window is gradually shifted into the hot region. Once the two start to overlap, the injected coincidences contribute to the coincidence count. With increasing overlap, the joint-surprise increases accordingly, until it reaches its maximum at maximal overlap. When the analysis window leaves the hot region, the overlap decreases again, and so does the jointsurprise. For symmetrical shape of the hot region (including spike densities and coincidence densities, as is assumed here), the joint-surprise is symmetrical around the center of the hot region. Injected coincidences are detected as unitary events if the size of the analysis window is between Tmin and Tmax . Only in the special case that Tw is equal to or narrower than the width of the hot region and the joint surprise just reaches threshold at its peak, unitary events are restricted to the extent of the hot region. Generally, however, the epoch over which unitary events are detected does not coincide with the extent of the hot region: it may be narrower but also wider, depending on the size of the various windows. Figure 3 summarizes the possible combinations. The time course of the joint-surprise function indicates how the width of the analysis window can be optimized. From the extent of the plateau, we can derive the width of the hot region. Table 1 summarizes the necessary relationships. By shifting the time window Tw in smaller time steps than the width of the analysis window (usually we shift by one bin), we introduce dependencies between the time windows, since the analysis is applied to partly overlapping time segments. When shifting the window in single bin steps, each single joint-event will be considered in Tw analyses, although in slightly
Unitary Events: II. Nonstationary Data
101
differing contexts. As a result, a single event may be evaluated with differing signicance values in the various analyses. Assuming continuity of the processes, the signicance does not change drastically from one window to the next. One possibility for dealing with these dependencies is to give “bonus points” each time a joint-event surpasses a certain signicance level. Thus, an individual spike constellation would collect bonus points as the window is shifted along the data, indicating that its accumulated signicance “counts.” This procedure would lead to a gradual evaluation of the “unitarity” of those events. The decision for the nal selection of unitary events could be based on a threshold on bonus points. For reasons of simplicity, however, we chose a simpler version: we dene an event as unitary once it fullls a certain signicance criterion (usually a D 0.05 or 0.01) in at least one window. In terms of bonus points, this implies a selection on the basis of a threshold set to 1. A practical evaluation of the performance of a more elaborate bonus point rule is currently under study. Nonstationary coincidence rates (e.g., a hot region) may be the result of loose locking of assembly activation to an (external) trigger event. In this situation, the optimal window for UEMWA is determined by the degree of temporal locking. In physiological data, however, several internal triggers that we do not know of may lead to hot regions of different temporal widths, which cannot optimally be captured by one sliding window size. Thus, an interesting perspective would be to develop an algorithm that dynamically adapts the width of the analysis window to the varying width of the hot regions. 5.2 An Alternative Method of Unitary Event Detection: Cluster Analysis. UEMWA deals with nonstationary ring rates by sliding an analysis
window that is narrow enough to obtain ring rates that are approximately stationary over the extent of the window for all positions, along the trial. There is a second approach to obtain segments of data with joint-stationary rates, based on cluster analysis. In the following, we discuss this option briey, because some cortical data indeed exhibit joint rate states and the approach has interesting relationships to other methods of analyzing multiple single-neuron spiking activity (e.g., hidden Markov models (HMM); Abeles et al., 1995; Seidemann, Meilijson, Abeles, Bergman, & Vaadia, 1996; Gat, Tishby, & Abeles, 1997). The idea is to segment the data into (exclusive) joint-stationary subintervals, using a standard cluster algorithm (e.g., Hartigan, 1975). Subsequently the data are analyzed for unitary events in the time segments, dened by the joint stationary rate states. We call this method unitary event by cluster analysis method (UECA) (a detailed description is given in appendix D). In this method, the width of the analysis window is dened by the covariations of the ring rates of single neurons. However, from our theoretical results about the detectability of hot regions, we know that the optimal width of the analysis window is given by the width of the hot region or, in more general terms, the time course of the coincidence rate. Thus,
102
S. Grun, ¨ M. Diesmann, and A. Aertsen
a clustering approach is useful if the occurrence of excess coincidences is connected to the rate state. This, however, is not in agreement with experimental data (Riehle et al., 1997). Moreover, in UECA, the data are analyzed in exclusive regions. This implies transitions in signicance from one time step to the next. Moreover, if a segment is detected by UECA as signicant, that entire segment is marked as containing unitary events, even if a hot region covers only part of the segment. In the case of UEMWA, the analysis window positions are independent from rate transitions. They cover the entire data set step by step, resulting in a measure that is at the same time more localized (only data from a connected time interval enter the analysis) and smoother (a single event is weighed in several consecutive windows) than UECA. Taken together, for the experimental data we have analyzed so far, UEMWA provides a more differentiated picture of the presence of unitary events. We cannot exclude, however, that experimental settings may arise in which UECA, with its variable window size and the property that the counting statistics are not limited by the local window size, might be a promising alternative. 5.3 False Positives. Various sources of false positives can be distinguished. Here, we discuss only those that arise in direct connection to the nonstationarity extension presented in this article. For a more general discussion of the topic of false positives in unitary event analysis, we refer to the companion article. First, there are sources for false positives specic to the moving window analysis. The signicance level a, which we demand for events to be qualied as unitary, implies by denition a certain number of false positives. For example, if a is set to 0.01, we expect in 1% of the experiments a detection of signicant events by chance. In the case of UEMWA, we undertake, in fact, many of such experiments by analyzing step by step successive parts of a single data set. However, these experiments are not independent due to the overlap of the time windows. For a rough estimate of the number of windows that is expected to give rise to false positives, one has to calculate the number of nonoverlapping windows that would t within the total length of the data segment and take a fraction a of it. In our calibration experiments on simulated data, accidental crossings of signicance threshold were extremely rare—typically, one window per data set at most, and mostly none. As a rule of thumb, we have developed the criterion that only those cases in which the number of windows that passes the signicance threshold is clearly larger than this lower bound are considered as potentially interesting. A more systematic treatment of this issue is under development. Important sources of false positives are nonstationarities of different avors that give rise to signicant results, although the processes observed actually do not violate the null-hypothesis of independence. The obvious source is a remaining nonstationarity of rate in the analysis window. In appendix B, we show that unitary event analysis is robust against moderate
Unitary Events: II. Nonstationary Data
103
violations of the assumption of rate stationarity and quantify the effect on the number of false positives. Specic types of nonstationarity across trials discussed in the literature are variations of “excitability” and of “latency variability” (e.g. Brody, 1999a, 1999b). Knowledge (or “educated guesses”) of the type of nonstationarity in the data is important, because it may allow for a compensation of its effects in the analysis. Variation of excitability describes a nonstationarity in which the time course of the ring rate is identical in each trial, although the amplitude is modulated (e.g. Arieli, Sterkin, Grinvald, & Aertsen, 1996). Latency variability describes a nonstationarity in which shape and amplitude of the ring rate are identical in each trial, but the position in time shifts from trial to trial. In both cases, the ring-rate estimation obtained by averaging across trials, as performed in the PSTH and the unitary event analysis, leads to a value that is not representative for a single trial. In the case of excitability variations, the rate is underestimated for some trials and overestimated for others. In the case of latency variability, the rate estimate will generally present a blurred picture of the rate dynamics in an individual trial due to the convolution with the latency distribution. Thus, “misalignment” of trials may lead to falsely detected unitary events (see the example in Figure 8A, bottom panel). A solution to this problem is to realign the data to an external or behavioral event, to which the single trial rate functions of the observed neurons have a more proper locking. In case no such events are available, one can try to nd a consistent realignment directly based on the single trial rate functions themselves (as shown in Nawrot, Rotter, Riehle, & Aertsen, 1999; Nawrot, Aertsen, & Rotter, 1999; see also Baker & Gerstein, 2000). Figure 8 illustrates how proper realigmnment of trials is able to discard false positives (unitary events after RS in A). Note that in order to maintain a common reference time frame, the same realignment should be performed for all neurons under consideration. This implies, however, that when the latency variabilities of (some of) the observed neurons do not co-vary, such joint realignment is not possible (Nawrot, Rotter, et al., 1999). In such a case, one needs an alternative method to estimate the ring probabilities and the associated coincidence expectancy on the basis of single-trial data. Several such methods have been proposed recently, including convolution-based methods (Nawrot, Rotter, et al., 1999; Nawrot, Aertsen, & Rotter, 1999) and inverse-interval-based methods (Nawrot, Rotter, & Aertsen, 1997; Pauluis & Baker, 2000). The incorporation of these methods into the unitary event analysis is in progress. To ultimately demonstrate the consistent occurrence of unitary events, one may apply an additional test on a meta-level. This can be done by relating the unitary events to behavioral or external events or by an additional statistical test (Prut et al., 1998). An example where unitary events occurred in relation to behaviorally relevant events is shown in Figure 6. On the basis of a meta-analysis over many data sets, Riehle et al. (1997) demonstrated that unitary events occur in relation to stimulus and expected events.
neuron #
neuron #
neuron #
104
S. Grun, ¨ M. Diesmann, and A. Aertsen
Dot Display
Dot Display
Raw Coincidences
Raw Coincidences
Unitary Events
Unitary Events
4
7
4
7
4
7
1250 1000 750 500 250
0 250 ms RS
1250 1000 750 500 250
0 250 ms MVT
Figure 8: False positives due to nonstationarity across trials. (A, B) Identical spiking activities of two neurons (4,7) recorded from primary motor cortex while the monkey performed a delayed reaching task (1 direction out of 6; Grammont & Riehle, 1999). Columnar organization of the dot displays is the same as in Figure 6. In A, trials are aligned to the reaction signal (RS). In B, trials are aligned to the onset of movement (MVT). Trials are sorted according to increasing reaction time (time difference between RS and MVT) in both cases. All external and behavioral events are marked by lled circles. The events at the beginning of the data segments are the GO signals; the events late in the trials are the ends of movements. The unitary events in A about 250 ms before RS correspond to the ones in B at about 500 ms before MVT. Note, however, that the unitary events in A after the RS have vanished in B. These unitary events were due to nonstationarity across trials: an abrupt decrease in ring shortly before and locked to MVT dispersed to a different position in each trial by the incorrect alignment to RS. This misalignment of rate functions is removed in B, thereby discarding the falsely detected unitary events.
5.4 Guidelines for Application to Experimental Data. Based on the foregoing, we suggest the following procedure for unitary events analysis. First, inspect the data critically for across-trial nonstationarities (e.g., excitability and/or latency variability) by the use of the raster displays. For checking excitability variations, one may use as a rough estimate the number of spikes per trial. If necessary, eliminate the outlier trials. Latency variabil-
Unitary Events: II. Nonstationary Data
105
ity has to be eliminated by realignment of the trials, by either alignment of the data to another behavioral event or estimating the instantaneous rates per single trial and realigning the data by matching the shape of the rate functions (Nawrot, Aertsen, & Rotter, 1999). Since generally we do not have prior knowledge about the time structure of the coincident events, we may get some qualitative information about their composition, their distribution in time, and whether there are obvious deviations from the ring rates by inspecting the raw coincident events. The appropriate size of the analysis window cannot be known in advance. It has to be adjusted according to two aspects: rate changes and coincidence rate changes. In order to capture rate changes such that stationarity of rate within a window can be assumed, the window has to be adjusted to the timescale of the rate dynamics. Since we have no prior knowledge about the timescales of the coincidence rate dynamics, we have to scan the data with different window sizes. If we have an indication for a specic time structure from the raw data, we may start with a size in that range. In general, however, we found it best to start with a narrow analysis window and gradually increase its size. If the underlying time structure of coincident ring is clustered, the unitary events will appear at a certain minimum size of the analysis window and will be detected over a certain range of window widths beyond it. If the detected time structure is stable but only broadens due to the increased size of the analysis window, a hot region is detected. For an indication of the best analysis window size, the shape of the jointsurprise function may be used. Plateaus indicate a window size that is either too small or too large. A peaky shape indicates that the optimal window for a hot region size has been found. False positives may be identied by evaluating whether the structure is stable for different alignments to external events. We are aware that this is time-consuming. In further work we intend to develop additional statistical tests on a meta-level, to lter out false positives automatically. However, to improve the analysis method further, we need to gain experience from applications of the unitary events analysis method to physiological data. In addition, we have found it useful to compare the outcome of our analyses to the result of other techniques (e.g. cross-correlation or joint peristimulus time histogram, JPSTH). An additional parameter is the coincidence width. In order to determine the coincidence width of the experimental data, additional manipulations may be performed, for example, changing the bin width before analyzing the data for unitary events (see the companion article) or applying the multiple shift method (Grun ¨ et al., 1999; Riehle, Grammont, Diesmann, & Grun, ¨ 2000). 5.5 Unitary Events in Cortical Activity. We have analyzed simultaneously recorded multiple single-neuron spike trains from frontal and motor cortices in awake, behaving monkeys for the occurrence of signicant co-
106
S. Grun, ¨ M. Diesmann, and A. Aertsen
incident events. Our ndings indicate that highly precise (1–5 ms) unitary events occur in these data. Their joint-surprise may be well above 2; that is, they are statistically highly signicant. The composition and frequency of such patterns appear to be related to behavioral parameters. These results, together with results from other multi-neuron studies, are interpreted as expressions of cell assembly activity (for reviews, see Gerstein, Bedenbaugh, & Aertsen, 1989; Singer et al., 1997; Singer, 1999). The composition of unitary events is interpreted to reect the common membership in a cell assembly. If the composition of the unitary events changes depending on the stimulus or the behavioral conditions, a different group and, hence, a different cell assembly may be activated in relation to the external event. In the example in Figure 7, we showed that the occurrence of unitary events was locked to the onset of the movement, but the composition of them was different for the different movement directions. Similar ndings were made in visual and frontal areas in cross-correlation studies, when neurons were found to be correlated for one stimulus or behavioral condition but not for another (e.g., Vaadia et al., 1991; Aertsen et al., 1991; Vaadia et al., 1995; Freiwald, Kreiter, & Singer, 1995; Kreiter & Singer, 1996; Fries, Roelfsmema, Engel, Konig, ¨ & Singer, 1997; Castelo-Branco, Goebel, Neuenschwander, & Singer, 2000). Moreover, JPSTH results from frontal cortex show that the correlation between two neurons may dynamically change depending on the behavioral context, suggesting that the neurons rapidly change their associations into different functional groups (Aertsen et al., 1991; Vaadia et al., 1995). We demonstrated that unitary events show a marked increase in temporal structure as compared to the spiking events of the participating neurons, including cases where the single neurons did not show any discernible response, as judged from the absence of systematic modulations of their ring rates. This may indicate that neuronal computation uses different kinds of timescales, usually referred to as rate coding and temporal coding (see also Abeles, 1982b; Neven & Aertsen, 1992; Koenig, Engel, & Singer, 1996; Riehle et al., 1997; Shadlen & Newsome, 1998). We have begun to investigate if and how these concepts are implemented in the cortical network. It would seem possible that both coding mechanisms—rate coding and precise time coding—are used in the brain, and that depending on the cortical area, either one might dominate (see Vaadia & Aertsen, 1992, for a detailed discussion on this issue). We hope that the unitary event method presented here may help to decipher the mechanisms of neuronal information processing in the brain. Appendix A: Notation
h Tw M
time resolution of data, [h] D unit of time size of the moving window, integer valued, odd number of trials
Unitary Events: II. Nonstationary Data
l lc npred nemp S a na Tc Ta Tmin Tmax Tp Ts fmin
107
background ring rate, [l] D 1 / unit of time coincidence rate, [lc ] D 1 / unit of time expected coincidence count empirical coincidence count joint surprise signicance level number of coincidences needed for signicance duration of the “hot region” minimum analysis window width required by na minimum analysis window width to reach signicance maximum analysis window width to still reach signicance duration of the plateau of the joint-surprise function duration of interval showing unitary events minimal overlap of analysis window Tw and hot region for detection of unitary events in the sliding situation
All capital “T” variables specify time intervals in units of h, [T] D 1. Appendix B: False Positives Induced by Nonstationarity
In order to estimate the error made by assuming stationarity of rates within the analysis window, we consider the worst-case scenario (with respect to mean count). It can be shown that the maximal error is given if the rate changes in stepwise fashion (in comparison to, say, a linear change of rate), if both neurons change their rates in parallel, and, if the analysis window is centered at the time of rate change. Thus, in the following we consider the situation where two neurons change their rate at the same time from the same rate level l1 to l2 . The analysis window is centered at the rate change; that is, the window Tw is divided into two regions of duration T2w , where the rates are stationary at level l1 and l2 , respectively. The equations used in the following correspond to equations 3.3 and equation 3.2, however here with lc D 0 s ¡1 , and adjusted to the special case sketched here. The mean rate within Tw is lN D
( l1 ¢
Tw C 2
Tw
l2 ¢
Tw ) 2
.
(B.1)
Using the averaged rate, the number of expected coincidences is (cf. equation 3.3) nQ D ( lN ¢ h ) 2 ¢ Tw ¢ M.
(B.2)
The exact number of expected coincidences is given by calculating the number of coincidences in each time segment separately and taking the sum of the two: n¤ D ( ( l1 ¢ h) 2 C ( l2 ¢ h ) 2 ) ¢
Tw ¢ M, 2
(B.3)
108
S. Grun, ¨ M. Diesmann, and A. Aertsen
expressing a special case of equation 3.2. The error in number of coincidences is 1 nQ ¡ n¤ D ¡Tw h2 ( l1 ¡ l2 ) 2 . 4
(B.4)
The larger the difference of the rate levels is, the larger is the error. The exact number of expected coincidences is larger than the approximated number Q Thus, if we estimate the number of coincidences by averaging the rates n. within the analysis window, we tend to overestimate the signicance. The relevant parameter for the signicance test, the number of expected coincidences, denes the mean of the assumed Poisson distribution. This number, however, is determined by the rates, the size of the analysis window, and the number of trials. For simplicity, we assume a constant Tw . Different rate combinations of l1 and l2 can lead to the same number of expected coincidences. Thus, in order to study false positives, we derive the dependence of the exact number of coincidences n¤ as a function of the approximated number of coincidences nQ by eliminating the rates. We dene rD
l2 l1
(B.5)
as the ratio of the rates. n¤ D n¤,1 C n¤,2 is the total number of expected coincidences, with n¤,i D
1 Tw M (li ) 2 h2 2
(B.6)
the expected number of coincidences in each segment. Using equations B.5 and B.6, we yield n¤ D
2( 1 C r2 ) ¢ n. Q ( 1 C r) 2
(B.7)
Q is always ¸ 1, that is, for all values of r, n¤ , nQ ¸ 1. The slope of n¤ (n) In order to calculate the critical rate relation that leads to a signicant outcome in expectation, we systematically vary n¤ by varying r and compute the minimal number of coincidences needed for signicance nQ a , given a signicance level a D 0.01. The intersection of n¤ ( r, nQ ) and nQ a ( r, nQ ) gives the critical rate relation r. The larger r is, the steeper the slope of n¤ ( r, nQ ) and thus the smaller the minimal nQ for which false positives are obtained. Q r ) is invertible. The critical Interestingly, the mapping of ( l1 , l2 ) to ( n, rate relation and the corresponding rates are shown as functions of nQ in Figure 9A. Up to now, we have considered only the case where n¤ is above nQ a , corresponding to a signicant outcome in expectation. Now we are interested in the effective signicance level or, equivalently, the percentage of false
Unitary Events: II. Nonstationary Data
109
Figure 9: False positives induced by a stepwise rate change. As the worst-case scenario, two neurons are considered that change their rates in parallel stepwise from rate level l1 to l2 , and the analysis window is centered at the time of rate change. (A) Critical rate relation r D l2 / l1 (top) that leads to false positives at a given number of coincidences n. Q The corresponding rate levels are shown in the bottom panel. (B) Percentage of false positives. For each parameter constellation (n,r), Q the contour plot shows the percentage of false positives for a signicance level a D 0.01: contour lines at 1% (dashed) and 5%, 10%, . . . , 100% (solid). Gray scale indicates the percentage of false positives. In the stationary situation (r D 1), the percentage of false positives equals a.
Q r ). Therefore, we rst calcupositives at a given parameter constellation ( n, late for each nQ the minimal number of coincidences nQ a at signicance level a D 0.01 assuming a Poisson distribution with mean n. Q Then we determine the signicance level for nQ a , assuming now the exact mean number of coincidences n¤ and the corresponding distribution. Note that the distribution of coincidence counts is the convolution of the distributions for the two segments. Figure 9B illustrates that the larger nQ is, the lower the rate ration r that can be tolerated. Appendix C: Size of the Analysis Window C.1 Stationary Coincidence Rate. The need for a minimal window size in order to be able to detect excess coincidences will be derived for the case when both background rates and the injected coincidence rate are stationary. The minimal number of coincidences (na ) that just fullls the condition S ( na, npred ) D Sa depends nonlinearly on the number of occurrences expected at chance level (see Figure 4 in the companion article); for higher npred , disproportionately more coincidences nc are needed to reach signicance. Moreover, the minimum number of coincidences to reach threshold na can take only discrete values. This induces discrete jumps in na and in the
110
S. Grun, ¨ M. Diesmann, and A. Aertsen
40
30
n
na
20
n
20 c
20 npred Significance
0
40
4
T =100 c l =0.9 s 1 c
100 200 300 400 500 Significance
S
2
c
l (s 1)
0 0
na
10
l =0.4 s 1
1 0
S=10 S=1
200 400 600 800 1000 T w Tmin
2 0
100 200 300 400 500 T w T T min
max
joint-surprise function (see Figure 10). In the stationary case, equations 3.1 and 3.3 reduce to nemp D [lc h C (lh ) 2 ] ¢ MTw npred D [( lc C l) h]2 ¢ MTw . nemp and npred are both linearly dependent on Tw . As a result, we can express nemp as a linear function of npred : (lh ) 2 C lc h pred n (lh C lc h ) 2 ³ ´ lc h ¼ 1C npred , ( lh) 2
nemp ( npred ) D
(C.1) (C.2)
where in the latter expression we have neglected second-order rate terms involving lc . This linear function has a single intersection with the significance threshold na (assuming na to be a smooth function), yielding the minimal number of coincidences considered to be signicant. na however, is a function of the width of the analysis window. Hence, for each combination of l and lc , there is a minimal analysis window width Ta needed to detect coincidences as signicant events. In the situation described here, with coincidences injected over an arbitrarily long time interval, the minimal analysis window width can always be realized. The pair ( l, lc ) determines the value of Ta (here: Ta ¼ 220 ms), while the amount of data available determines whether an analysis window of this width can be realized. When Ta can indeed be realized, we call Tmin D Ta; in section C.2 we will see that this is not always the case in the nonstationary situation.
Unitary Events: II. Nonstationary Data
111
As it is clear from C.2, the function nemp ( npred ) is steeper if the rate of the injected coincident events is higher and hence crosses the curve of na at a lower value of nemp and, consequently, of T a. This behavior is summarized in the bottom graph of Figure 10A. Thus, in the stationary case, detection of excess coincident events can be ensured by increasing the analysis window. In addition, enlarging the window decreases possible effects of the stepwise behavior of the joint-surprise. C.2 Nonstationary Coincidence Rate. We now derive the conditions for the appropriate window size in the case when coincidences occur clustered in time on top of a constant background rate. Consider rst a situation where Figure 10: Facing page. Minimal and maximal window size. (A) Minimal window size. Existence and parameter dependence of the minimal analysis window size Ta needed to detect injected coincidences as signicant. (Upper graph) Independent from window size, the number of coincidences minimally needed (na ) to reach the signicance criterion is a function of the signicance level a and the expected number of coincidences npred (gray curve, here for a D 0.01). With increasing npred , na increasingly deviates from the diagonal. The solid line is the number of coincidences nemp for npred coincidences due to the rate of the independent processes (l D 20 s ¡1 ) and additional injected coincidences (lc D 0.4 s¡1 ). At a certain npred (here nmin ¼ 9), nemp intersects na from below. The minimal size of the analysis window is dened by the requirement that the expected number of coincidences is at least nmin . (Lower graph) Assuming stationary rates, npred is proportional to the size of the analysis window Tw . Thus, by proper scaling, the abscissa can as well be expressed in Tw . The dashed-dotted line connecting the two graphs indicates the minimal window size Tmin D Ta for the example pair (l, lc ). The contour plot (contour lines shown for S D 1, 2, . . . , 10) shows the dependence of the joint-surprise S on Tw and lc for the xed example value of l. For lc D 0.4 s ¡1 , a joint surprise value of 2 corresponding to a D 0.01 is reached at a window size of about 220 ms. The sensitivity of the method decreases rapidly when windows narrower than 100 ms are used. For windows wider than 220 ms, sensitivity increases only slowly. (B) Maximal window size. Same analysis as in A for a nonstationary coincidence rate. Coincidences are injected at rate lc D 0.9 s ¡1 in a “hot region” of duration Tc D 100, centered on t D 500, M D 100, h D 1 ms (background rate l D 20 s ¡1 ). Detectability as a function of the width of the analysis window, centered in the hot region. (Upper graph) Dashed line is the diagonal, where as many coincidences occur as expected for independent rates; the gray curve describes the minimal number of coincidences na needed to fulll the signicance criterion. For analysis windows below Tc , the number of coincidences increases faster than the minimal number and surpasses the minimal number at Tmin . For Tw wider than Tc , the number of coincidences increases with unit slope, slower than na . There is a second intersection at Tmax , above which the number of coincidences is no longer signicant. The lower graph illustrates the dependence of the joint-surprise on Tw .
112
S. Grun, ¨ M. Diesmann, and A. Aertsen
the analysis window is placed in the middle of a hot region of width Tc , and the width of the analysis window Tw is gradually increased (see Figure 10B). As long as Tw · Tc , we face the situation of stationary injected events, discussed in the preceding section (cf. Figure 10A). Hence, we need a minimum width of the analysis window (Ta ) to detect the cluster of injected events. Ta depends on only the combination of l and lc ; its value can be obtained from the calibration graph for the stationary situation (see Figure 10A, bottom). When the analysis window exceeds the hot region (Tw > Tc ), the total number of coincidences increases further. However, the slope is now reduced to (lh ) 2 , since the contribution of the injected coincidences remains constant (MTc ¢ lc h). If nemp ( Tw ) increases faster than na ( Tw ), nemp can intersect na from below, yielding the minimal window size Ta (compare to section C.1). Since the width of the hot region denes the size of the analysis window, from where on nemp ( Tw ) shows a reduced slope, a cluster can be detected only if T c ¸ Ta , with Tmin equal to Ta . For Tc < T a, nemp ( Tw ) bends before reaching na. As a result, Tmin does not exist. If nemp ( T w ) increases faster than na ( T w ) , there is always a Ta , but in contrast to the stationary situation, the existence of Tmin depends on the width of the hot region Tc . When Tc < Ta , the cluster cannot be detected, even with arbitrarily large analysis windows. The reason is that at Tw D Tc , we have nemp ( Tw ) < na ( Tw ), and for Tw > Tc , the slope nP emp ( Tw ) < nP a ( Tw ) . As a result, nemp ( Tw ) remains below na ( Tw ). This argument also implies that when the cluster can be detected (nemp ( Tw ) ¸ na ( T w ) for Ta · Tw · Tc ), a second intersection of nemp ( Tw ) and na ( Tw ) must exist for some Tw > Tc . Hence, in that case, there is a Tmax ¸ Tc at which nemp ( Tw ) intersects na ( Tw ) from above. Two conclusions can be drawn from this. First, a cluster of excess coincidences is not detectable if its duration Tc remains below the critical time span Ta . Second, even if the cluster is detectable (Tc ¸ Ta ), it may still go undetected if the analysis window is either too small Tw < Ta or too wide Tw > Tmax . The range of appropriate window sizes can be obtained from calibration graphs as in Figure 10B. When the analysis window is shifted gradually across the hot region, the time course of the joint-surprise S appears symmetrical around the center of 6 Tc that degenerates the hot region. S has a trapezoidal shape in case of Tw D to a triangle in the special case that T w D Tc . The duration of the plateau as a function of the parameter pair Tc , Tw is summarized in Table 1. Note that in two of the three possible outcomes regarding the observables Tp and Tw (Tp D 0 and Tp ¸ T w ), we have a unique expression for Tc (in the remaining case Tp < Tw , two possibilities exist for Tc ). This uniqueness can be exploited to determine the extent of the hot region from the size of the plateau: » Tc D
Tw Tw C Tp
for T p D 0 for T p ¸ Tw .
(C.3)
Unitary Events: II. Nonstationary Data
113
Because we are free in choosing the size of the analysis window Tw , the requirement of equation C.3 can always be met. However, even for the case where the relationship is not unique (Tp < Tw ), a small variation of Tw immediately allows differentiating between the two possible values of Tc : 8 T > > w > Tw C Tp > > : Tw ¡ Tp
for Tp D 0 for Tp ¸ Tw
for Tp < Tw and TP p ( Tw ) > 0 for Tp < Tw and TP p ( Tw ) < 0.
(C.4)
From the shape of the signicance curve and the relationships worked out in equations C.3 and C.4, we can estimate Tc . If the signicance threshold is reached at all, typically more than one window is signicant. Only in the special case of T w D Tc D Ta is the maximum of S exactly at threshold level and for a single window only. For a given combination T c , Tw , we can compute the minimal overlap f of the two windows needed to detect the injected coincidences as signicant. The overlap f describes the part of Tc “seen” by Tw , expressed as a fraction of Tw : ( 0
Tc D
f ¢ Tw Tc ,
for f ¢ Tw < Tc otherwise.
(C.5)
We can now use Tmax to compute the minimal f . To this end, we reverse the question that originally led to the denition of Tmax : given a xed Tw , what is the minimal Tc0 (and hence, fmin ) needed to detect the injected coincidences as signicant? Formally, this fmin can be computed as follows: 0 ) D Tw Tmax ( T c,min Tmax ( fmin ¢ Tw ) D T w 1 ¡1 ( Tw ) . ¢ Tmax fmin ( Tw ) D Tw
(C.6) (C.7) (C.8)
Having found the minimal overlap fmin , we can construct the extent of the region Ts in which injected coincidences are marked as signicant (the time span between the two intersections of S with the signicance threshold): Ts D ( Tw ¡ fmin ¢ Tw ) C Tc C ( Tw ¡ fmin ¢ Tw ) D 2 ¢ Tw C Tc ¡ 2 ¢ fmin ¢ Tw D Tc C 2 ¢ ( 1 ¡ fmin ) ¢ Tw .
(C.9) (C.10) (C.11)
The minimal Ts is obtained for T w D Ta : in that case, fmin D 1. Hence, it follows that Ts D Tc .
114
S. Grun, ¨ M. Diesmann, and A. Aertsen
Figure 11: Multimodal rate amplitude histogram of a frontal cortex neuron, observed over 58 trials of 1500 ms duration. The rate was estimated by calculating a PSTH, smoothed with a window of 200 ms. Appendix D: Detecting Unitary Events by Cluster Analysis
The approach in UECA is based on the assumption that the neurons’ ring rates observed in parallel can only be in one of a nite number of joint rate “states” (Abeles et al., 1995; Seidemann et al., 1996). Using a clustering algorithm, segments of joint-stationary rates can be detected. In each of the resulting time segments, the unitary events analysis for the stationary case (see the companion article) is then performed separately. The amplitude distributions of the neurons’ ring rates in many cases show indications of multimodality, suggesting that the ring rates can be in any of a nite number of states (e.g., Figure 11). Accordingly, we try to separate the combined rate activity of several simultaneously recorded neurons into joint stationary regions, dened as the time segments in which all the ring rates of the N observed neurons are stationary in parallel. This is achieved on the basis on estimates of the instantaneous ring rates (e.g., PSTHs) of each neuron. At each instant of discretized time, we derive a joint-rate vector, its components being the rate of neuron i at time t: 2
3 l1 ( t ) 6 .. 7 6 . 7 6 7 ¡ ! 7 l ( t) D 6 6 li ( t) 7 , 6 . 7 4 .. 5
i D 1, . . . , N.
(D.1)
lN ( t ) These vectors are grouped in N-dimensional l-space into clusters of similar joint-rate vectors by the k-means clustering algorithm (Hartigan, 1975). It clusters according to R centers of gravity ¡ m!k . Since we usually do not know the number of the underlying rate states of our data beforehand, we have
Unitary Events: II. Nonstationary Data
115
to vary the number of clusters and check the clustering result in each case. The stopping criteria for the “correct” number of clusters are given by the constraints that each potentially underlying joint state should be captured and that states should not be split into articial ones (“overtting”). For this purpose, we dened the following pairwise relative cluster distance: !¡¡ |¡ m m!k | l dk,l D ¡ , ! |! sl | C | ¡ sk |
6 l, 8 k, l 2 1, . . . , R, k D
(D.2)
¡ ! sk representing the vector of standard deviation in the kth cluster. The stopping criteria are fullled if d k,l ¸ 1,
6 l. 8 k, l 2 1, . . . , R, k D
(D.3)
We call the nal set of cluster joint-stationary rate states. Estimates of the ring rates are obtained by identifying the membership of each joint-rate ¡ ! vector l with the cluster mean it belongs to. Projecting the cluster mean back to the associated position on the time axis, we can observe the time course of cluster membership (PSTHs in Figure 12). The set of time instances identied as belonging to the same cluster denes the time segmentation of a state. Data from all segments belonging to one state are analyzed as one single stationary data set. Note that the set of time steps belonging to a single cluster is not necessarily compact and may in fact contain a number of separate time intervals, over which the neurons have approximately the same rates. The performance of the clustering algorithm on a set of simulated, parallel Poisson processes is illustrated in Figure 12. Two extreme cases of ring-rate variations are chosen, covering the possible features of gradual (Figure 12A) and stepwise (Figure 12B) changes. To avoid overtting, that is, tting of states to variations that are due only to statistical uctuations, we smoothed the PSTHs by using a moving average (Abeles, 1982a; Kendall, 1976). The choice of the width of the smoothing window has to be a compromise between large enough to reduce the noise level and small enough not to atten out meaningful changes of the ring rate (see also Nawrot, Aertsen, et al., 1999). Observe that trajectories of the joint-rate vectors form clouds in rate space, well separated for rapid changes of the ring rates (see Figure 12A). Clusters can clearly be identied. By contrast, and not unexpectedly, gradual changes of ring rates (see Figure 12B) give rise to less well separated clouds or even smooth trajectories. Acknowledgments
We thank Moshe Abeles, George Gerstein, and Gunther ¨ Palm for many stimulating discussions and help in the initial phase of the project. We especially thank Moshe Abeles, Hagai Bergmann, Eilon Vaadia, and Alexa
116
S. Grun, ¨ M. Diesmann, and A. Aertsen
Figure 12: Unitary events by cluster analysis. Dot displays (middle) show the spiking activity of two parallel processes, the rates of which were varied in stepwise (A) or gradual (B) fashion. (A) 100 trials on the basis of time-varying rates in stepwise fashion: l1 ( i ) D 30 s ¡1 , for i 2 [1, 250) ; l1 ( i) D 60 s ¡1 , for i 2 [250, 500) ; l1 ( i) D 40s ¡1 , for i 2 [500, 1000]I l2 ( i) D 30 s ¡1 , for i 2 [1, 250) ; l2 ( i) D 50 s ¡1 , for i 2 [250, 750) ; l2 ( i) D 10 s¡1 , for i 2 [750, 1000]. (B) 100 (out of the 1000) trials in which neuron 2 was simulated with a constant rate of 30 s ¡1 , whereas the rate of neuron 1 is changing linearly throughout the trial from l1 D 20 s ¡1 to 90 s ¡1 . The PSTHs (bottom panels) represent the time course of the instantaneous ring rates (smoothing window of 20 ms). (Top panels) Trajectories of the joint-rates in rate space. Each instance of a joint-rate vector is represented by a small gray star (abscissa: rate of neuron 1, ordinate: neuron 2). Members of different clusters are indicated by different gray levels. Cluster means are indicated by black crosses and standard deviations by the length of the cross-lines. Rates corresponding to the cluster means are superimposed on the PSTHs in the bottom panels. Cluster means approximate the original stationary rate levels; standard deviations (dashed lines) illustrate the variability of the rates within each cluster. Stepwise changes of cluster means indicate the timing segmentation used in further analysis. In B, clustering resulted in an average width of a time segment of 75 ms.
Unitary Events: II. Nonstationary Data
117
Riehle for kindly putting their experimental data (AR: motor cortex; MA, HB, EV: frontal cortex) at our disposal and for the many exciting discussions on our results. We also thank Robert Gutig, ¨ Stefan Rotter, and Wolf Singer for their constructive comments on an earlier version of the manuscript for this article. This work was partly supported by the DFG, BMBF, HFSP, GIF, and Minerva. References Abeles, M. (1982a). Quantication, smoothing, and condence limits for singleunits’ histograms. J. Neurosci. Meth., 5, 317–325. Abeles, M. (1982b). Role of cortical neuron: Integrator or coincidence detector? Israel J. Med. Sci., 18, 83–92. Abeles, M., Bergman, H., Gat, I., Meilijson, I., Seidemann, E., Thishby, N., & Vaadia, E. (1995). Cortical activity ips among quasi stationary states. Proc. Nat. Acad. Sci. USA, 92, 8616–8620. Abeles, M., & Goldstein, M. H. (1977). Multispike train analysis. Proc. IEEE, 65(5), 762–773. Aertsen, A., & Gerstein, G. L. (1985). Evaluation of neuronal connectivity: Sensitivity of cross-correlation. Brain Research, 340, 341–354. Aertsen, A., Vaadia, E., Abeles, M., Ahissar, E., Bergman, H., Karmon, B., Lavner, Y., Margalit, E., Nelken, I., & Rotter, S. (1991).Neural interactions in the frontal cortex of a behaving monkey: Signs of dependence on stimulus context and behavioral state. J. Hirnf., 32(6), 735–743. Arieli, A., Sterkin, A., Grinvald, A., & Aertsen, A. (1996). Dynamics of ongoing activity: Explanation of the large variability in evoked cortical responses. Science, 273(5283), 1868–1871. Baker, S. N., & Gerstein, G. L. (2000). Improvements to the sensitivity of gravitational clustering for multiple neuron recordings. Neural Comp., 12, 2597–2620. Brody, C. D. (1999a). Correlations without synchrony. Neural Comp., 11, 1537– 1551. Brody, C. D. (1999b). Disambiguating different covariation types. Neural Comp., 11, 1527–1535. Castelo-Branco, M., Goebel, R., Neuenschwander, S., & Singer, W. (2000). Neural synchrony correlates with surface segregation rules. Nature, 8(405), 685–689. Freiwald, W., Kreiter, A., & Singer, W. (1995).Stimulus dependent intercolumnar synchronization of single unit responses in cat area 17. Neuroreport, 6, 2348– 2352. Fries, P., Roelfsema, P., Engel, A., Konig, ¨ P., & Singer, W. (1997). Synchronization of oscillatory responses in visual cortex correlates with perception in interocular rivalry. Proc. Nat. Acad. Sci. USA, 94, 12699–12704. Gat, I., Tishby, N., & Abeles, M. (1997). Hidden Markov modelling of simultaneously recorded cells in the associative cortex of behaving monkeys. Network: Comp. Neural Sys., 8, 297–322. Gerstein, G. L., Bedenbaugh, P., & Aertsen, A. (1989). Neuronal assemblies. IEEE Trans. Biomed. Eng., 36, 4–14.
118
S. Grun, ¨ M. Diesmann, and A. Aertsen
Grammont, F., & Riehle, A. (1999). Precise spike synchronization in monkey motor cortex involved in preparation for movement. Exp. Brain Res., 128, 118–122. Grun, ¨ S., Diesmann, M., Grammont, F., Riehle, A., & Aertsen, A. (1999).Detecting unitary events without discretization of time. J. Neurosci. Meth., 94, 67–79. Hartigan, J. A. (1975). Clustering algorithms. New York: Wiley. Kendall, M. (1976). Time-series (2nd ed.). London: Charles Grifn and Company. Koenig, P., Engel, A. K., & Singer, W. (1996). Integrator or coincidence detector? The role of the cortical neuron revisited. TINS, 19(4), 130–137. Kreiter, A., & Singer, W. (1996).Stimulus-dependent synchronization of neuronal responses in the visual cortex of awake macaque monkey. J. Neurosci., 16(7), 2381–2396. Mountcastle, V. B., Reitbock, ¨ R. J., Poggio, G. F., & Steinmetz, M. A. (1991). Adaptation of the Reitboeck method of multiple electrode recording to the neocortex of the waking monkey. J. Neurosci. Meth., 36, 77–84. Nawrot, M., Aertsen, A., & Rotter, S. (1999). Single-trial estimation of neuronal ring rates. J. Neurosci. Meth., 94, 81–91. Nawrot, M., Rotter, S., & Aertsen, A. (1997). Firing rate estimation from single trial spike trains. In N. Elsner & H. Waessle (Eds.), G¨ottingen Neurobiology Report 1997 (Vol. 2, p. 623). Stuttgart: Thieme Verlag. Nawrot, M., Rotter, S., Riehle, A., & Aertsen, A. (1999). Variability of neuronal activity in relation to behaviour. In N. Elsner & U. Eysel (Eds.), Proceedings of the 1st G¨ottingen Neurobiology Conference of the German Neuroscience Society 1999 (Vol. 1, p. 101). Stuttgart: Thieme Verlag. Neven, H., & Aertsen, A. (1992).Rate coherence and event coherence in the visual cortex: A neuronal model of object recognition. Biol. Cybern., 67, 309–322. Pauluis, Q., & Baker, S. N. (2000). An accurate measure of the instantaneous discharge probability, with application to unitary joint-event analysis. Neural Comp., 12(3), 647–669. Prut, Y., Vaadia, E., Bergman, H., Haalman, I., Hamutal, S., & Abeles, M. (1998). Spatiotemporal structure of cortical activity: Properties and behavioral relevance. J. Neurophysiol., 79(6), 2857–2874. Reitbock, ¨ H. J. (1983). A multi-electrode matrix for studies of temporal signal correlations within neural assemblies. In E. Basar, H. Flohr, H. Haken, & A. J. Mandell (Eds.), Synergetics of the brain (pp. 174–182). Berlin: Springer-Verlag. Riehle, A., Grammont, F., Diesmann, M., & Grun, ¨ S. (2000). Dynamical changes and temporal precision of synchronized spiking activity in motor cortex during movement preparation. J. Physiol. (Paris), 94(5–6), 569–582. Riehle, A., Grun, ¨ S., Aertsen, A., & Requin, J. (1996). Signatures of dynamic cell assemblies in monkey motor cortex. In C. von der Malsburg, J. Vorbruggen, ¨ & B. Sendhoff (Eds.), Articial Neural Networks—ICANN ’96 (pp. 673–678). Berlin: Springer-Verlag. Riehle, A., Grun, ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and rate modulation differentially involved in motor cortical function. Science, 278, 1950–1953. Riehle, A., Seal, J., Requin, J., Grun, ¨ S., & Aertsen, A. (1995). Multi-electrode recording of neuronal activity in the motor cortex: Evidence for changes in
Unitary Events: II. Nonstationary Data
119
the functional coupling between neurons. In H. J. Hermann, D. E. Wolf, & E. Poppel ¨ (Eds.), Supercomputing in brain research: From tomography to neural networks (pp. 281–288). Singapore: World Scientic. Seidemann, E., Meilijson, I., Abeles, M., Bergman, H., & Vaadia, E. (1996). Simultaneously recorded single units in the frontal cortex go through sequences of discrete and stable states in monkeys performing a delayed localization task. J. Neurosci., 16(2), 752–768. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18(10), 3870–3896. Singer, W. (1999). Neural synchrony: A versatile code for the denition of relations. Neuron, 24, 49–65. Singer, W., Engel, A. K., Kreiter, A. K., Munk, M. H. J., Neuenschwander, S., & Roelfsema, P. R. (1997). Neuronal assemblies: Necessity, signature and detectability. Trends in Cognitive Sciences, 1(7), 252–261. Vaadia, E., & Aertsen, A. (1992). Coding and computation in the cortex: Singleneuron activity and cooperative phenomena. In A. Aertsen & V. Braitenberg (Eds.), Information processingin the cortex (pp. 81–121). Berlin: Springer-Verlag. Vaadia, E., Ahissar, E., Bergman, H., & Lavner, Y. (1991). Correlated activity of neurons: A neural code for higher brain functions? In J. Kruger ¨ (Ed.), Neuronal cooperativity (pp. 249–279). Berlin: Springer-Verlag. Vaadia, E., Bergman, H., & Abeles, M. (1989). Neuronal activities related to higher brain functions—theoretical and experimental implications. IEEE Trans. Biomed. Eng., 36(1), 25–35. Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neuronal interactions in monkey cortex in relation to behavioural events. Nature, 373, 515–518. Received July 20, 2000; accepted April 24, 2001.
LETTER
Communicated by George Gerstein
Statistical Signicance of Coincident Spikes: Count-Based Versus Rate-Based Statistics Robert G utig ¨
[email protected] Ad Aertsen
[email protected] Stefan Rotter
[email protected] Neurobiology and Biophysics, Institute of Biology III, Albert–Ludwigs–University, 79104 Freiburg, Germany Inspired by different conceptualizations of temporal neural coding schemes, there has been recent interest in the search for signs of precisely synchronized neural activity in the cortex. One method developed for this task is unitary-event analysis. This method tests multiple singleneuron recordings for short epochs with signicantly more coincident spikes than expected from independent neurons. We reformulated the statistical test underlying this method using a coincidence count distribution based on empirical spike counts rather than on estimated spike probabilities. In the case of two neurons, the requirement of stationary ring rates, originally imposed on both neurons, can be relaxed; only the rate of one neuron needs to be stationary, while the other may follow an arbitrary time course. By analytical calculations of the test power curves of the original and the revised method, we demonstrate that the test power can be increased by a factor of two or more in physiologically realistic regimes. In addition, we analyze the effective signicance levels of both methods for neural ring rates ranging between 0.2 Hz and 30 Hz. 1 Introduction
Thanks to advances in neurophysiological recording technology, it is now feasible to experimentally test biological hypotheses about cortical information processing and neuronal cooperativity on the basis of multiple singleneuron recordings (Aertsen, Bonhoeffer, & Kruger, ¨ 1987; Nicolelis, 1998). However, due to the stochastic appearance of neural response patterns in the cortex (Palm, Aertsen, & Gerstein, 1988), the neurobiological concepts in question must be translated into precise statistical hypotheses to be veried by specically designed statistical tests. Inspired by a number of different conceptualizations of temporal neural coding schemes, such as correlational cell assemblies (Aertsen & Gerstein, 1991; Gerstein, Bedenbaugh, & Aertsen, 1989; von der Malsburg, 1981), coherent oscillations (Singer, 1993), and precise ring patterns (Abeles, 1982, Neural Computation 14, 121–153 (2001)
° c 2001 Massachusetts Institute of Technology
122
R. Gutig, ¨ A. Aertsen, and S. Rotter
1991), there has been particular emphasis on the search for synchronized neural activity (Abeles & Gerstein, 1988; Abeles, Bergman, Margalit, & Vaadia, 1993; Kreiter & Singer, 1996; Prut et al., 1998; Singer, 1999). One of the methods developed for this task is unitary-event analysis (Grun, ¨ 1996; Grun, ¨ Diesmann, Grammont, Riehle, & Aertsen, 1999; Grun, ¨ Diesmann, & Aertsen, in press-a; Grun, ¨ Diesmann, & Aertsen, in press-b; Riehle, Grun, ¨ Diesmann, & Aertsen, 1997). This method searches recordings from multiple single neurons for epochs with distinctly more (near-)coincident spikes than expected from independent neurons obeying Poissonian spike statistics. The core of unitary-event analysis consists of computing the probabilities (joint-p-values) for the occurrence of a given minimum number of coincident spikes in short time segments, under the null hypothesis of independence. Segments with a joint-p-value below a xed level of significance a are identied as signicant epochs where the null hypothesis is rejected. Here we demonstrate that by revising the original testing procedure, more specically, by implementing a different coincidence count distribution, we can substantially increase the power of the method. Figure 1 compares the behavior of the original and the modied versions of the method on an empirically motivated, simulated data set of two neurons, stretching over epochs of correlated activity. Figure 1D depicts all epochs marked as signicant by either of the two methods. The plot shows that the original version (Bin) misses epochs of synchronous neural activity that are detected by the modied version (Hyp). The primary goal of this article is to investigate and compare two central properties—the power and the effective signicance level—of the statistical test underlying the original and the revised versions of unitary-event analysis. Our investigation focuses on the identication of signicant epochs; we will not treat statistical problems of interdependent testing, arising when
Figure 1: Facing page. Comparison of the original and the modied version of the statistical test underlying unitary-event analysis applied to simulated data. Matching the data of Riehle, Grun, ¨ Diesmann, and Aertsen (1997), we simulated 36 trials of two dependent neurons over 1300 bins (each of 5 ms duration) with the spike event probabilities p1 D 0.15 and p2 D 0.05, corresponding to the mean ring rates of 30 Hz and 10 Hz. (A, B) Raster plots of neurons 1 and 2. (C) The spike event and coincidence counts for 65 nonoverlapping analysis ms windows of 100 ms duration (n D 36 ¢ 100 720). (D) The upper part shows 5 ms D the correlation of spike counts r of the two neurons that was controlled through the stochastic model underlying the simulation of the data (see section 2 for details). The lower part depicts the epochs that were marked as signicant by the original test (upper row) and the modied test (lower row) at a signicance level of a D 0.01 (dotted line in E). (E) The corresponding joint-p-values for both versions of the test, calculated in each of the analysis windows.
Statistical Signicance of Coincident Spikes
Spike Trains
Trial #
A 30 20 10 0
B Trial #
123
Bin #
30 20 10 0
C
100 10 1
200
400
Spike Count 1 Spike Count 2
600
800
1000
1200
Counts
Coincidence Count
D
Significant Epochs
E0
Joint P Values
0.1 0.05 0 Bin. ® Hyp. ® 10
10 1 10 2 3
10
4
10
5
10
a Level Binomial Distribution Hypergeometric Distribution 10
20
30
40
Window #
50
60
analysis windows overlap. Preliminary results have been presented in abstract form (Gutig, ¨ Rotter, Grun, ¨ & Aertsen, 2000). 2 Stochastic Model
First, we sketch the mathematical framework underlying our statistical assessment of the signicance of coincident spiking of two neurons. Based on
124
R. Gutig, ¨ A. Aertsen, and S. Rotter
this framework, in section 3 we summarize the statistical test underlying the original unitary-event analysis (Grun, ¨ 1996) and describe our revised version of the analysis method. One key element of our approach is to parameterize the stochastic dependence of the two neurons in terms of their spike correlation. This will be used in section 4 to calculate and compare the power of both statistical tests. The extension of this new approach to the case of three or more neurons is conceptually straightforward but numerically demanding. For clarity, most mathematical details are deferred to appendix A. We assume the analysis to extend over a time window composed of n pairs of corresponding time bins (one for each neuron) of width D t. Each bin can take either the value 1, denoting the observation of at least one spike in that time bin (we refer to this as a spike event), or the value 0, denoting the absence of spikes. Our treatment will refer only to joint observations of all n bins in the analysis window and disregard any information about the ne structure of the data in individual trials. Hence, an application of our results to pooled data from a multiple-trial design needs to assume stationarity across trials. Two main assumptions will be made in the following: A1 : The spike probabilities and the binwise spike correlations of the two neurons are the same for all n pairs of bins of the analysis window (“stationarity”). A2 : All bins, except for those in a pair, are stochastically independent (“serial independence”). Note that assumption A1 is conceptually not essential to our framework, which is suited to treat a neural system with nonstationary spike probabilities. Doing so, however, would give rise to probability distributions with many parameters (cf. appendix C), greatly complicating the calculations. By contrast, assumption A2 clearly imposes severe restrictions on the generality of the approach. Although it was not always explicitly stated, this assumption also underlies previous treatments of the topic (Grun, ¨ 1996; Roy, Steinmetz, & Niebur, 2000). The important task of overcoming the difculties introduced by serially correlated spike trains is the subject of ongoing research. As shown in appendix A, assumptions A1 and A2 allow us to completely characterize the probability space describing the realization of two neural spike trains by specifying the individual spike probabilities of the two neurons, p1 and p2 , and their spike correlation r. We will abbreviate this parameter triplet throughout by j :D ( p1 , p2 , r ) . The correlation r parameterizes the stochastic dependence of the two neurons, with the case of independently spiking neurons corresponding to r D 0. Note that for binary variables as used here, the notion of stochastic (in)dependence and the
Statistical Signicance of Coincident Spikes
125
Table 1: 2 £ 2 Table of Counts for Analysis Windows Comprising n Bins. Spike from
No Spike from
Neuron 1
Neuron 1
Spike from Neuron 2
k
No Spike from Neuron 2
c1 ¡ k
c2 ¡ k n ¡ c1 ¡ c2 C k
c1
n ¡ c1
c2 n ¡ c2 n
Note: The counts c1 and c2 denote the number of spike events from neurons 1 and 2, respectively, and k denotes the number of coincident spike events, in one particular observation
notion of correlation coincide. Thus, in our treatment of the test power in section 4, we will use nonvanishing values of r to quantify violations of the null hypothesis of independent ring. We will indicate the case of stochastic independence by writing j 0 instead of j . The central statistics of the following treatment will be the individual spike counts, C1 and C2 , and the coincidence count K, all derived from an observation of the two neurons in the analysis window. Since, according to our denition of spike events, each bin can hold at most one count, the total number of spike events observed from each neuron in a given analysis window is restricted to the integers between 0 and n. And the number of coincident spike events K cannot exceed either one of the corresponding spike counts C1 and C2 . As shown in Table 1, the statistics C1 , C2 , and K give rise to a 2 £ 2 table of counts for each realization of the analysis window. We note in passing that because of assumption A1 , the correlation r between spike events in coincident bins (see equation A.3) equals the correlation of the spike event counts of the two neurons, independently of the number of bins n. Both tests for independence discussed in the following use the coincidence count K as their test statistic; they dene statistical signicance based on the number of coincident spikes observed within the analysis window. Thus, the probability distribution of this random variable, Pj ( K D k) , directly enters the computation of the joint-p-values in each of the tests and therefore affects their statistical properties. The two methods, however, differ in the amount of information they draw from the data. To make this distinction clear, we discuss here both probability distributions for the case of independent neurons, that is, Pj0 ( K D k) where r D 0. The general forms of these distributions for arbitrary r 2 [¡1, 1] and their derivations are given in appendix B. We emphasize that while the distributions for r D 0 sufce for the denition of the statistical tests in the next section, the calculation of test power in section 4 will rely on the general expressions from appendix B. We rst consider the situation where only the (constant) spike probabili-
126
R. Gutig, ¨ A. Aertsen, and S. Rotter
ties p1 and p2 of both neurons are known. In particular, no further knowledge of the spike counts c1 and c2 is assumed. Then, based on assumptions A1 and A2 , the probability distribution of K for independent neurons is given by the binomial distribution,
³ ´ Pj0 ( K D k) D
n k
( p1 p2 ) k (1 ¡ p1 p2 ) ( n¡k) .
(2.1)
Note that this distribution of coincidence counts was used in the original denition of unitary-event analysis (Grun, ¨ 1996), as well as in several recent contributions about the method (Grun ¨ et al., 1999; Roy et al., 2000). In the second case, which is conceptually different, we consider the coincidence count distribution, making explicit use of the knowledge of the individual spike counts c1 and c2 . In this case, we formulate the probability distribution conditional on the specic realization of the spike counts, that is, Pj ( K D k | c1 , c2 ) . Unlike the spike probabilities, these counts are readily accessible in any empirical spike train. It can be shown (cf. appendix B) that the conditional distribution of K for independent neurons is described by the hypergeometric distribution (cf. Palm et al., 1988):
³ ´³ c1
Pj0 ( K D k | C1 D c1 , C2 D c2 ) D
k
n ¡ c1 c2 ¡ k
³ ´ n
´ .
(2.2)
c2 Note that this distribution does not refer to the spike probabilities p1 and p2 anymore. Moreover, it can easily be veried that it is symmetric with respect to c1 and c2 . The general form of the conditional distribution for r 2 [¡1, 1] is much more complicated, though. Its explicit form is derived in appendix B. 3 Count-Based Versus Rate-Based Statistics
Typically, the test of a statistical hypothesis on the grounds of empirical data is based on a specially designed random variable, the so-called test statistic. In order to control the probability that the test fails, certain aspects of the distribution of the test statistic under the null hypothesis must be known. The probability that the null hypothesis is rejected, even if it is correct (type I error, or a-error), is usually xed at some small value (e.g., 5 %). Essentially, this is achieved by adjusting the critical region of the test through the choice of an appropriate threshold of the test statistic (Mood, Graybill, & Boes, 1974). Similarly, the case where the null hypothesis is not rejected, even if it is false, is referred to as a type II error, or b-error. Given
Statistical Signicance of Coincident Spikes
127
a parametric model of deviations from the null hypothesis, which species the probability distribution of the test statistic under a given deviation, it is possible to calculate the b-error probability of the test. Based on this probability, one can directly assess the power of the test—the probability of correctly detecting a given violation of the null hypothesis. Within the framework of the stochastic model dened in the previous section, including the assumptions A1 and A2 , the null hypothesis underlying the detection of signicant epochs comprises the following additional assumption for each analysis window: H 0: The activity of the two neurons is stochastically independent, r D 0. The original version of unitary-event analysis is based on the binomial coincidence count distribution (see equation 2.1). In this approach, empirical estimates for the parameters p1 and p2 on the basis of the spike counts c1 and c2 , pOi D
ci n
( i D 1, 2) ,
(3.1)
are used to calculate the probability that kQ or more coincident events are observed under the given conditions. This probability, here denoted by Q c1 , c2 ) , is refered to as the joint-p-value (Grun, ( k, ¨ 1996) Jjbin 0
³ ´
Q c1 , c2 ) ( k, Jjbin 0
n X n ± c1 c2 ² k ± c1 c2 ² ( n¡k) 1 ¡ . :D n2 n2 k Q
(3.2)
kD k
We emphasize that this procedure uses the observed spike counts solely for the estimation of the spike probabilities. One effect of this is that the binomial distribution gives nonvanishing probabilities for the impossible outcome k > min( c1 , c2 ). In addition, the binomial coincidence count distribution does not take into account the stochastic nature of the rate estimation procedure itself. To make better use of the information contained in the spike counts, we propose to compute the joint-p-values from the conditional probabilities Pj0 ( k | c1 , c2 ) instead. Given assumption H 0, together with the empirically accessible spike counts c1 and c2 , these probabilities are determined by the hypergeometric distribution (see equation 2.2; Palm et al., 1988; Aertsen, Gerstein, Habib, & Palm, 1989; Lehmann, 1997). Thus, by using the empirical spike counts to specify the conditional distribution of coincidence counts (see equation 2.2), we can completely eliminate the rate estimation (see equation 3.1) from the testing procedure. Accordingly, the joint-p-value
128
R. Gutig, ¨ A. Aertsen, and S. Rotter
hyp
Q c1 , c2 ) of an epoch with kQ coincident spikes, based on the hypergeoJj0 ( k, metric distribution, is given by
³ ´³ c1
hyp
Q c1 , c2 ) :D Jj0 ( k,
min(c X1 ,c2 ) kD kQ
k
n ¡ c1 c2 ¡ k
³ ´ n
´ .
(3.3)
c2
To illustrate the difference between the two approaches, Figure 2A shows that for large enough values of the coincidence count k, the joint-p-values according to the binomial distribution exceed the corresponding probabilities based on the hypergeometric distribution. Thus, although the two distributions have equal mean, the binomial distribution overestimates the probability for the occurrence of large coincidence counts, for certain combinations of individual spike counts. The effect is that for the parameter values in Figure 2A, a statistical test using the hypergeometric distribution would classify an observation of kQ ¸ 12 as signicant (a D 0.05; dotted line), whereas the corresponding test based on the binomial distribution would need at least kQ D 13 coincident spikes. Figure 2B shows the corresponding differences in these critical coincidence counts for all count combinations c1 , c2 2 f0, . . . , 100g. Light areas depict count combinations where the critical coincidence count for the binomial distribution exceeds the critical count of the modied test by one; for dark areas, the difference is zero. This example suggests that the use of the conditional coincidence count distribution may effectively lead to an increase in sensitivity of the test in certain parameter regimes. In fact, it is a result from mathematical statistics that the randomized version of the proposed test, called Fisher ’s exact test in the nonrandomized form presented here, is uniformly most powerful unbiased (Lehmann, 1997). A brief discussion of the randomized test is given in appendix E. Figure 2A also illustrates that in general, the joint-p-values of the critical coincidence counts do not equal the nominal signicance threshold a. Moreover, for a given a-level, both tests will in general not operate on the same effective signicance level.
Figure 2: Facing page. Comparison of binomial and hypergeometric distributions. (A) Joint-p-value: cumulative binomial and hypergeometric probability distributions for observing k or more coincidence counts for n D 720, c1 D 100, and c2 D 51. (B) Difference between the critical coincidence counts kcrit of both distributions for n D 720 and a D 0.05. Light areas depict count combinations where the critical coincidence count of the binomial distribution exceeds the critical coincidence count of the hypergeometric distribution by one. For count combinations in the dark areas, the difference is zero.
Statistical Signicance of Coincident Spikes
A
129
Joint P Value Binomial Distribution Hypergeometric Distribution
0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0
11
12
B
13
14
15
16
17
Difference in Critical Counts 100 90 80 70 60 50 40 30 20 10 0 0
10 20
30 40
50 60
70 80 90 100
Next to the issue of sensitivity of the two tests that will be addressed in the following section, we note one further important implication of using the hypergeometric coincidence count distribution. As shown in appendix C, the modied version of the test does not require that both neurons have con-
130
R. Gutig, ¨ A. Aertsen, and S. Rotter
stant ring rates. Rather, it sufces if the spike probability of only one of the two neurons (either one) remains constant throughout the analysis window. Since the original method had to assume stationarity of both neural ring rates, this relaxation of the stationarity requirement A1 implies an important extension of the class of empirical data that qualify for the statistical analysis described here. Note, however, that this relaxation concerns only the inferential statistical testing of H 0 as implemented by our revised method. The following investigation of the statistical properties of the method when applied to neurons that violate the hypothesis of independence still relies on stationarity in both neurons. 4 Test Power
To obtain a thorough understanding of the properties of the statistical signicance tests dened above, we calculated the power function (Mood et al., 1974) of the tests for both coincidence count distributions. Specically, this will allow us to quantify the advantage of using the hypergeometric count distribution concerning its test performance. For a given set of alternative hypotheses, the power of the statistical tests investigated here is the probability of obtaining a coincident count that yields a signicant nding when tested against the null hypothesis. Loosely speaking, the power of the test measures the probability that the test will detect a given violation of the null hypothesis. According to the stochastic model dened in section 2, we use r to parameterize the set of alternative hypotheses. We emphasize that the power curves we will compute only characterize the sensitivity of the methods regarding violations of the null hypothesis of independent ring (H 0). Our treatment does not include violations of the other assumptions A1 and A2 . We also note that the power curves will be calculated with respect to identical nominal signicance levels a. Hence, the resulting values describe the more common nonrandomized application of the methods, as discussed in recent contributions (Grun, ¨ 1996; Grun ¨ et al., 1999; Pauluis & Baker, 2000; Roy et al., 2000). The effective signicance levels of the tests, however, generally differ, and, hence, differences in test power will depend on differences in effective signicance levels. crit We introduce the auxiliary functions kcrit bin and khyp : bin ( c1 , c2 ) :D minfk 2 N: Jjbin ( k, c1 , c2 ) · ag kcrit 0 hyp
hyp
kcrit ( c1 , c2 ) :D minfk 2 N: Jj0 ( k, c1 , c2 ) · ag.
(4.1) (4.2)
For any given combination of spike event counts c1 and c2 , these k-values give the minimum number of coincident spikes that leads to a rejection of the null hypothesis at the signicance level a, when tested against the null hypothesis underlying the corresponding coincidence count distribution. The probability of rejecting the null hypothesis for given counts c1 and c2
Statistical Signicance of Coincident Spikes
131
and j is then determined by Pj ( K ¸ kcrit ( c1 , c2 ) | C1 D c1 , C2 D c2 ) ,
(4.3)
which can be straightforwardly computed from equation B.7. It is important to note that this rejection probability is a function of the individual spike counts c1 and c2 . This means that it is not an intrinsic property of the test itself, but rather a property of the test in combination with a specic empirical observation. Thus, to calculate the power of a test, we need to calculate the expectation value of the rejection probability with respect to the joint count distribution (see equation B.5), which is also specied through the parameter j of an alternative hypothesis. Thus, for a given j , the power p j of the test is given by the expectation value of the rejection probability for each of the two underlying coincidence count distributions: h i pjbin D Ej Pj ( K ¸ kbin crit ( c1 , c2 ) | C1 D c1 , C2 D c2 ) D
hyp
pj
n X c1 ,c2 D 0
Pj ( C1 D c1 , C2 D c2 )
£ Pj ( K ¸ kbin crit ( c1 , c2 ) | C 1 D c1 , C 2 D c2 ) h i hyp D Ej Pj ( K ¸ kcrit ( c1 , c2 ) | C1 D c1 , C2 D c2 ) D
n X c1 ,c2 D 0
(4.4)
Pj ( C1 D c1 , C2 D c2 ) hyp
£ Pj ( K ¸ kcrit ( c1 , c2 ) | C1 D c1 , C2 D c2 ) .
(4.5)
We refer to appendix D for an account on the numerical evaluation of these probabilities. 4.1 Difference in Test Power. In this section, we investigate the differences in power of the rate-based versus the count-based tests for signicance of coincident spike activity. Specically, we will discuss how this difference depends on the signicance level a and the number of bins in the analysis window n and how this difference is affected by the spike probabilities p1 and p2 . Apart from comparing the performance of the two methods, our analysis is also valuable for applications to empirical data. For a given set of parameters, knowledge of the power function allows us to assess whether a planned analysis is feasible or, vice versa, it can be used to guide the design of experiments. As mentioned before, we will use the spike correlation r to parameterize deviations from the null hypothesis of independent ring. The value of r will be varied over a physiologically realistic regime, as reviewed by Abeles (1991; (see also Aertsen & Gerstein, 1985).
132
R. Gutig, ¨ A. Aertsen, and S. Rotter
We start the comparison by calculating the power curves of both analysis methods in a parameter regime that matches the empirical ndings published by Riehle et al. (1997). Accordingly, we let the analysis window cover n D 720 bins of 5 ms each (originally collected from 36 trials) and set the spike probabilities of the two neurons to p1 D 0.15 and p2 D 0.05, corresponding to mean ring rates of 30 Hz and 10 Hz, respectively. We computed the power curves for the signicance levels a D 0.01, a D 0.05 (this value was used by Riehle et al., 1997), and a D 0.1. Figure 3 shows that the tests based on the hypergeometric coincidence count distribution clearly outperform the original tests in terms of their power. Considering a spike correlation of r D 0.1, we see from Figure 3B that for a D 0.01, the chance of rejecting the (false) null hypothesis of independent neurons is increased by over 0.1 compared to the original method. This corresponds to a relative increase in power of about 50% (see Figure 3). From Figures 3A and 3B we also see that the difference between the tests increases when they are chosen to operate at more conservative signicance levels (i.e., lower a). The inset of Figure 3A shows that for large analysis windows (here n D 720), the power decreases smoothly as a is decreased. It is interesting to note that the relative difference in test power (see Figure 3C) monotonically increases (up to 150%) as r approaches zero. Thus, other than the difference in power, which follows a bell-shaped curve (see Figure 3B), the relative difference in power becomes maximal for r D 0, that is, for independent neurons. This means that, as already suggested by Figure 2, the original method yields less signicant ndings in a low-correlation regime. In other words, for r ! 0, it effectively operates at a lower probability of yielding a false-positive nding than the count-based test. This is important, since it implies that the neglect of count information leads to a conservative behavior of the original test as applied in the past. However, as we will see, this is not the case for small values of n ¼ 20 (i.e., for narrow analysis windows and single trial applications) in connection with specic values of a. While both tests, as expected, gain power with increasing n (see Figure 4A), there is no qualitative change in the difference in power for larger analysis windows (n 2 f100, . . . , 700g, data not shown). The count-based test outperforms the rate-based version with differences in power qualitatively corresponding to the curves shown in Figure 3B. However, because of the overall increase in power, the peaks of the difference in power curves move toward smaller correlations as n grows. Therefore, the revised test is the method of choice when searching weak correlations in larger analysis windows (n 100). Another potentially interesting regime is given by small analysis windows (n ¼ 20), as would be needed for time-resolved single-trial analysis. Both tests suffer a considerable loss of power with decreasing n (see Figure 4A). Moreover, for small n, both methods become dominated by
Statistical Signicance of Coincident Spikes
A
133
Test Power
1 0.8 0.6 0.4
a = 0.1
0.05
0.2 0
0
B 0.12
0.01
0.8 0.6 0.4 0.01
r » 0.096
0.05
a
0.1
Binomial Distribution Hypergeometric Distribution 0.05
0.1
0.15
0.2
0.25
Difference in Test Power a = 0.01 a = 0.05 a = 0.1
0.1 0.08 0.06 0.04 0.02 0
C
0
0.05
0.1
0.15
0.2
0.25
Relative Difference in Test Power
2
a = 0.01 a = 0.05 a = 0.1
1.5 1 0.5 0
0
0.05
0.1
0.15
0.2
0.25
Figure 3: Dependence of both tests on nominal signicance level. (A) Test power curves of the two methods for n D 720, p1 D 0.15, p2 D 0.05, and a-levels of 0.01, 0.05, and 0.1. The inset shows the test power for a 2 f0.01, 0.011, . . . , 0.1g at r ¼ 0.096. (B) Differences in test power—the power values of the modied test minus the power values of the original version. (C) Difference in test power relative to the power of the original version of the test—the difference in power divided by the power of the original version. The analytical results shown in this gure were checked by computer simulations based on the Ran2 and the MT19937 random number generators (cf. Galassi et al., 1998).
134
R. Gutig, ¨ A. Aertsen, and S. Rotter
A
Test Power
1 0.8 0.6 0.4
Binomial Distribution Hypergeometric Distribution
0.2 0
B
20
100
200
300
400
500
600
700
Test Power
0.4 0.3 0.2 0.1 0
20
C
0.2
0.15
Binomial Distribution Hypergeometric Distribution 30
40
50
60
70
80
90
100
Test Power Binomial Distribution Hypergeometric Distribution
0.1 0.05 0
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
Figure 4: Dependence of both tests on analysis window size. (A, B) Test power curves for a D 0.05, p D p1 D p2 D 0.1, r ¼ 0.19, and n 2 f20, 21, . . . 700g. (C) Dependence of both tests on nominal signicance level for n D 20, p D p1 D p2 D 0.05, r ¼ 0.26, and a 2 f0.01, 0.011, . . . , 0.1g. Arrows mark values of a hyp where p jbin > p j .
Statistical Signicance of Coincident Spikes
135
discreteness effects of the underlying spike statistics (see Figure 4B). Thus, the difference between the two methods becomes extremely sensitive to the nominal level of signicance a, in a discontinuous way. For a small number of specic combinations of n and a (e.g. n D 20 and a D 0.049), the power of the rate-based test can be even greater than the power of the count-based test (see Figure 4C). Thus, it would not, as was the case for large n, miss signicant epochs but instead would indicate neural synchrony in cases that are not judged signicant, once knowledge of the individual spike counts is incorporated into the testing procedure. Due to this behavior for small n, the original method should not be used in these parameter regimes. Turning to the dependence of both tests on varying spike probabilities p1 and p2 , we rst let p :D p1 D p2 and computed the power curves for p ranging from 0.01 to 0.15 (again, n D 720 and a D 0.05). The results are shown in Figures 5A and 5B: as the power of both methods increases with growing spike probability p, the difference in their power also becomes larger. Thus, the advantage of the count-based method increases with higher neural ring rates. Also note that the power of both methods for low spike probabilities becomes small: the chance to detect a spike correlation of 0.1 is only around 30%. Finally, Figure 5C shows the difference in power for asymmetric spike 6 probabilities p1 D p2 with constant product p1 p2 D 0.0075. Observe that the difference between the two tests grows as the asymmetry in the spike probabilities increases. This result indicates that the common assumption of equal spike probabilities underlying other investigations of the method (Grun ¨ et al., 1999; Roy et al., 2000) may shadow nontrivial properties of the tests, which are enhanced in regimes with different spike probabilities. 5 Effective Signicance Level
Next to its power, the signicance level of a statistical method is another important characteristic of its performance in practical applications. In this section, we analyze and compare the signicance levels of the two tests. Generally, the signicance level a of a statistical test denotes the probability that it leads to a rejection of its null hypothesis, even if it were correct (i.e., the probability of making a type I error). In principle, this probability is xed before computing the test statistics. However, tests based on discrete random variables can effectively operate only at levels that correspond to p-values of actual realizations of the test statistic (cf. Figure 2; Mood et al., 1974). It is therefore necessary to differentiate between the nominal signicance level (i.e., the value of a denoting the signicance threshold) and the effective signicance level (i.e., the largest possible p-value that still falls below the a threshold) (see also Roy et al., 2000). In principle, “randomized tests” provide the means to adjust the effective signicance level of a test to its nominal level a by introducing an auxiliary random variable, not related to the data under testing. However, due to conceptual objections
136
R. Gutig, ¨ A. Aertsen, and S. Rotter
A
Test Power 1
0.8
p = 0.15 ¬
0.6 ¬ p = 0.01
0.4
Binomial Distribution Hypergeometric Distribution
0.2 0
0
B 0.12
0.05
0.1
0.15
0.2
0.25
0.3
Difference in Test Power p=0.02 p=0.03 p=0.04 p=0.05 p=0.06 p=0.07 p=0.08
0.1 0.08 0.06 0.04
p=0.09 p=0.10 p=0.11 p=0.12 p=0.13 p=0.14 p=0.15
0.02 0
0
C
0.2
0.05
0.1
0.15
0.2
0.25
0.3
Difference in Test Power p1=0.30, p1=0.25, p1=0.20, p1=0.15, p1=0.10,
0.15 0.1
p2=0.025 p2=0.03 p2=0.0375 p2=0.05 p2=0.075
0.05 0
0
0.05
0.1
0.15
0.2
0.25
Figure 5: Dependence of both tests on spike probabilities (n D 720, a D 0.05). (A) Test power curves of the two methods for “symmetrical” spike probabilities p D p1 D p2 2 f0.01, 0.02, . . . , 0.15g. (B) Differences in test power corresponding to A. (C) Differences in test power for asymmetrical spike probabilities ( p1 , p2 ) 2 f( 0.1, 0.075) , ( 0.15, 0.05), ( 0.2, 0.0375) , ( 0.25, 0.03) , ( 0.03, 0.025) g.
Statistical Signicance of Coincident Spikes
137
against this procedure, randomized tests are not commonly used in applied statistics (Mood et al., 1974), in spite of their attractive theoretical properties (Lehmann, 1997). A brief discussion of the randomization of the test investigated in this study is given in appendix E. hyp ( ) ( ) Given the critical coincidence counts kbin crit c1 , c2 and kcrit c1 , c2 introduced in the previous section (see equations 4.1 and 4.2), the effective signicance level of the tests investigated here is determined by the proba( ), bility of obtaining a coincident count k ¸ kbin crit c1 , c2 evaluated with respect to the coincidence count distribution of the corresponding test, that is, hyp hyp ( kbin ( ) ) ( kcrit ( c1 , c2 ) , c1 , c2 ) , respectively (cf. equaby Jjbin crit c1 , c2 , c1 , c2 and Jj0 0 tions 3.2 and 3.3). However, it is important to note that these effective significance levels can be calculated only if specic values of c1 and c2 are given. Only then is the coincident count distribution belonging to the corresponding test sufciently specied. Due to this dependence on the specic realization of the counts C1 and C2 , we will refer to these effective signicance levels as count-dependent effective signicance levels. Figure 6 shows the countdependent effective signicance levels for tests with a nominal signicance level of a D 0.05, for n D 20 and n D 720. The gure clearly demonstrates that the count-dependent effective signicance levels of both tests uctuate considerably between different realizations of the spike counts. In addition, it is interesting to note the effect of the neglect of count information by the original version of the test from the comparison of Figures 6A and 6C (see the gure caption for details). We stress that the calculation of the count-dependent effective significance level does not take into account that the spike counts themselves are random variables. Thus, it is without doubt the appropriate measure to assess the effective signicance level for the inferential statistical test of stretches of data with specic spike counts. However, one should be reluctant in interpreting this probability as the overall type I error probability of the method. This conceptual difference between the count-dependent effective signicance level and the unconditional effective signicance level (i.e., independent of the specic realization of the spike counts) gains crucial importance when interpreting the ndings reported by Roy et al. (2000), although these authors do not seem to make this distinction consistently. Instead of characterizing the method by its count-dependent effective signicance level, we introduce the expected effective signicance level aj0 . The latter is dened as the expectation value of the count-dependent effective signicance level with respect to the joint count distribution of independent neurons characterized by j 0. The interpretation of this measure as the expectation value of the count-dependent effective signicance level rests on the additional assumption that the parameters j0 D ( p1 , p2 , r D 0) of the joint count distribution are known. This assumption is not part of the null hypotheses of the tests investigated here. Thus, aj0 has to be interpreted as the expectation value of the count-dependent effective signicance level
138
R. Gutig, ¨ A. Aertsen, and S. Rotter
Binomial Distribution
A
B
50
20 15 10 5 0
0
5
10
15
20
0.04
40
0.04
0.03
30
0.03
0.02
20
0.02
0.01
10
0.01
0
0
0 10 20 30 40 50
0
Hypergeometric Distribution
C
D
50
20 15 10 5 0
0
5
10
15
20
0.04
40
0.04
0.03
30
0.03
0.02
20
0.02
0.01
10
0.01
0
0
0 10 20 30 40 50
0
Figure 6: Count-dependent effective signicance levels for a D 0.05. (A, C) Small analysis window, n D 20. The black regions correspond to count combinations where the count-dependent effective signicance level is zero; this occurs when either one of the counts is zero or when even the maximal possible number of coincidences—k D 20 for the binomial distribution and k D min( c1 , c2 ) for the hypergeometric distribution—does not yield a signicant p-value and, hence, the probability of rejecting the null hypothesis equals zero. (B, D) Large analysis window, n D 720. (A, B) Binomial coincidence count distribution. (C, D) Hypergeometric coincidence count distribution.
of the test when applied to neurons with spike probabilities p1 and p2 . By contrast, the landscapes of count-dependent effective signicance levels in Figure 6 do not depend on j 0 , that is, on the values of p1 and p2 . The parameter j0 enters the computation of the expectation value aj0 only through the joint count distribution underlying its denition. Averaging over the joint count distribution for independent neurons (cf. appendix B), the expected effective signicance level aj0 is obtained in straightforward fashion: h i bin ( bin ( ) ) ajbin E D J k c , c , c , c j 1 2 1 2 j crit 0 0 0
Statistical Signicance of Coincident Spikes
D hyp
aj0
n X c1 ,c2 D 0
h
139
( kbin Pj0 ( c1 , c2 ) Jjbin crit ( c1 , c2 ) , c1 , c2 ) 0
hyp
(5.1)
i
hyp
D Ej0 Jj0 ( kcrit ( c1 , c2 ) , c1 , c2 ) D
n X
hyp
c1 ,c2 D 0
hyp
Pj0 ( c1 , c2 ) Jj0 ( kcrit ( c1 , c2 ) , c1 , c2 ) .
(5.2)
A discussion of a numerical evaluation of these expectation values is given in appendix D. Since the test based on the hypergeometric coincidence count distribution makes no reference to the spike probabilities p1 and p2 , it is independent of the “actual” spike probabilities of the investigated neurons. Thus, hyp
its expected count-dependent effective signicance level aj0 reects the probability that the test will reject its correct null hypothesis when applied to data from two independent neurons with spike probabilities j 0. Note that this is not the case for tests based on the binomial coincidence count distribution, where the testing procedure includes the estimation of the spike event probabilities pi via pOi D ci / n (see equation 3.1). In general, the estimated probabilities pOi will deviate from the actual parameters for most spike count combinations. Moreover, as pointed out in section 3, the binomial coincidence count distribution neglects the information contained in the actual spike counts. As a result, the count-dependent effective signicance level based on these estimators will generally not correctly describe ( ). the probability of obtaining a coincident count k ¸ kbin crit c1 , c2 Therefore, the number ajbin is of purely theoretical interest and does not bear any op0 erational relevance. Thus, to calculate the probability that a test based on the binomial coincidence count distribution will reject the null hypothesis when applied to independent neurons with spike probabilities p1 and p2 , we need to compute the count-dependent probability to obtain a coincidence count ( ) k ¸ kbin crit c1 , c2 with respect to the hypergeometric distribution. Note, howbin ever, that kcrit ( c1 , c2 ) itself is calculated with respect to the binomial coincidence count distribution. Thus, again forming the expectation value with respect to the joint count distribution, we obtain h i hyp bin bin 2 j0 D Ej0 Jj0 ( kcrit ( c1 , c2 ) , c1 , c2 ) D
n X c1 ,c2 D 0
hyp
bin ( Pj0 ( c1 , c2 ) Jj0 ( kcrit c1 , c2 ) , c1 , c2 ) ,
(5.3)
which we will refer to as a-error probability. It is clear from the above that for the corresponding a-error probability for the test based on the hypergeometric count distribution 2
hyp j0 ,
we have 2
hyp j0
hyp
D aj0 .
140
R. Gutig, ¨ A. Aertsen, and S. Rotter
5.1 Dependence of the Effective Signicance Level on Spike Probabilities. In this section we investigate the dependence of the a-error probabil-
ities on the spike probabilities p1 and p2 , for both types of the test. Figure 7 shows the expected effective signicance levels aj0 and the a-error probability 2 jbin for p D p1 D p2 ranging from 0.001 to 0.15. The surface plots in 0 Figures 7C and 7D display the dependence of these curves on the number of bins n. Overall, Figure 7 shows that the curves for the a-error probability hyp
of the modied method aj0 lie above the curves for the a-error probability of the original method 2 jbin . Thus, especially for higher ring rates, where 0 the difference between the curves amounts to a considerable fraction of the nominal signicance level (a D 0.05), the a-error probability of the modied method lies closer to the nominal signicance level. In Figures 7A and 7B, we also display the count-dependent effective signicance levels that would arise from the binomial coincidence count distribution with coincidence probability p2 (indicated by diamonds), as considered by Roy et al. (2000). Note that in order to obtain a continuous sample of these count-dependent effective signicance levels, we did not restrict the values of p2 to possible realizations of c1 c2 / n2 but instead treated p as a continuous variable. In contrast to the view expressed by Roy et al. (2000), we emphasize that according to our formalism the interpretation of the ensuing sawtooth function (cf. Figure 7A) has to take into account that its independent variable p does not correspond to a neural spike probability, as it does in the calculation of aj0 and 2 jbin . Instead, this variable technically 0 corresponds to an estimator of the spike probability, which would be based on realizations of the random variables C1 and C2 in applications of the method to experimental data. In this context, it is important to recall that the count-dependent effective signicance levels are different for different spike count combinations (cf. Figure 6). Thus, because of the stochastic nature of the spike counts, the tests will in general operate on different count-dependent effective signicance levels for different realizations of the analysis window. Since this effect is due to the stochastic nature of the spike counts, it equally holds if the Figure 7: Facing page. Dependence of expected effective signicance level and a-error probability on analysis window size, for (A) n D 720 and (B) 20, 100, 500, and 1000, respectively. Diamonds depict corresponding count-dependent effective signicance levels (cf. Roy et al., 2000; see text for details). (C, D) The a-error probabilities for the binomial and hypergeometric coincidence count distribution for n 2 f20, 40, . . . , 1000g. The oscillatory behavior of the curves for n 100 at low values of p reects the periodic structure of the countdependent effective signicance level landscape (cf. Figure 6), which for small p dominates the expectation values because of the increasing localization of the joint count distributions. The analytical results shown in this gure were checked by computer simulations, based on the Ran2 random number generator (cf. Galassi et al., 1998).
Statistical Signicance of Coincident Spikes
A
141
Effective Significance Level, n = 720
0.05 0.04 0.03 0.02 0.01 0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
B n = 20
0.04
0.04
0.02
0.02
0
0
0.04
0.04
0.02 0
C 1000
n = 100
0.02 0.05
n = 500 0.1 0.15
0
0.05
n = 1000 0.1 0.15
D Hypergeometric
Binomial
0.005 0.035 0.03 0.025 0.02 0.015 0.01
800 600 400 200 0.05
0.1
0.15
0.005 0.035 0.03 0.025 0.02 0.015 0.01 0.05
0.1
0.15
probabilities p1 and p2 underlying the different realizations of the spike counts would remain perfectly constant. Therefore, contrary to the suggestion by Roy et al. (2000), the variation of count-dependent effective signicance levels between different realizations of the analysis window is not an issue of neural ring rates. In fact, the count-dependent effective signicance levels do not depend on the spike probabilities p1 and p2 . Therefore, a change of neural ring rates would in no way inuence the count-dependent effec-
142
R. Gutig, ¨ A. Aertsen, and S. Rotter
tive signicance levels—the shape of the sawtooth function in Figures 7A and 7B. Rather, it would lead to a change in the joint count distribution and, hence, affect the expected signicance levels aj0 and the a-error probability . As shown in Figures 7A and 7B, computing these expectation values 2 jbin 0 transforms the discrete sawtooth structure (diamonds) into smooth functions of p (curves). Thus, even for p as low as 0.02 (corresponding to a ring rate of 4 Hz for 5 ms bins) and n as in Figure 7, small changes in ring rate do not lead to large changes in the a-error probabilities of the two methods (as was claimed by Roy et al., 2000). 6 Discussion
We presented a modication of the statistical test underlying unitary-event analysis for the detection of neural synchrony. By incorporating the empirical spike counts into the calculation of the probability distribution of the test statistic (i.e., the coincidence count), we were able to remove the ring-rate estimation from the testing procedure. As a result, the distribution of the test statistic becomes independent of the a priori ring rates. The application of the modied method therefore avoids the problems of ring-rate estimation associated with statistical uctuations in the spike counts. To quantify the increase in sensitivity of the new method, we calculated and compared the test power of both tests with respect to violations of the null hypothesis of independent ring for various regimes of physiological parameters (ring rates; cf. Figure 5), degree of spike correlation (cf. Figures 3 and 5) and analysis parameters (size of analysis window; cf. Figure 4), and signicance level (cf. Figure 3). The spike probabilities p1 and p2 were chosen such that the corresponding neural ring rates (for bin size D t D 5 ms) lay between 2 Hz and 30 Hz. These results are of dual importance. First, they directly specify the probabilities of detecting given deviations from independent ring with the two methods. These probabilities are not only important quantities to characterize the performance of the statistical test, they are also critical in the context of experimental design: they allow one to choose appropriate values for analysis parameters such as the size of the time window necessary to verify a theoretically predicted dependence between two neurons. Second, the power curves allow us to quantify the effect of the suggested modication of the testing procedure on the performance of the test in comparison to the original version. This is important to reevaluate the results obtained by use of the original method and point out parameter regimes where the usage of the modied version of the test is especially crucial. Overall, we found that for applications of the test to analysis windows comprising larger values of n (n 100), the modication proposed leads to an increase in test power of up to 0.12 (50% relative increase). This increase becomes especially pronounced for conservative nominal signicance levels a, asymmetric ring regimes of the two neurons, high ring rates, and
Statistical Signicance of Coincident Spikes
143
moderate degrees of spike correlation. In general, for the ring rates analyzed, the peak values of the difference in test power were reached for values of the correlation between 0.05 and 0.15. This range corresponds to the empirical values of the asynchronous gain (ASG) found in the cortex (see reviews by Aertsen & Gerstein, 1985, and Abeles, 1991). For short time windows (n ¼ 20) we found that the test power of both methods falls below 0.2 (cf. Figure 4) for two neurons that operate below 30 Hz (D t D 5 ms) with r ¼ 0.25, corresponding to the maximal ASG for cortical neurons reported by Abeles (1991). Thus, when applying the test to single-trial data with short analysis windows comprising only a low number of bins, one has to face substantial reductions in test power. For these applications of the test, the increase in test power due to the modication of the method is vital. In addition, it is important to realize that for low values of n and for low ring rates, the discrete nature of the underlying test statistic dominates the properties of the test. Their dependence on n and a is complex in this regime, so special care should be taken with respect to the experimental design. In addition, we have calculated the expected effective signicance levels of the two tests and their a-error probabilities when applied to two neurons operating at equal rates, ranging from 0.2 Hz to 30 Hz (D t D 5 ms). As discussed in detail in section 5, it is important to differentiate between the count-dependent effective signicance level that does not depend on the neuronal spike probabilities and the expected effective signicance level that depends on the spike probabilities through the joint count distribution. Contrasting the view expressed by Roy et al. (2000), who based their analysis on the count-dependent effective signicance levels, our calculations show that the a-error probabilities of both methods vary only slowly as a function of ring rate above 4 Hz. Thus, in this regime, the probability of falsely indicating neural synchrony is only moderately sensitive to the ring-rate levels of the investigated neurons. While this result describes the behavior of the method when applied to neurons operating at certain ring rate levels, that is, independent of any specic empirical realization of spike counts ( c1 , c2 ) , the count-dependent effective signicance level (i.e., the effective signicance level of the test when applied to a stretch of data with a specic combination of spike counts) does show considerable uctuations depending on the joint counts in the analysis window (cf. Figure 6). This indeed implies uctuations of the count-dependent effective signicance level with respect to different realizations of the analysis window. We emphasize that these uctuations are not the result of changes in neural ring rates, but the consequence of stochastic uctuations in the counts themselves. For ring rates below 4 Hz, the joint count distribution tends to concentrate on a small number of count combinations. This causes the changes of the a-error probabilities 2
bin j0
and 2
hyp j0
of the two versions of the test with
144
R. Gutig, ¨ A. Aertsen, and S. Rotter
changes in neural ring rates to become more pronounced. For neurons operating at rate levels below 1 Hz, the joing count distribution becomes so narrow that the expectation value of the count-dependent effective significance level essentially behaves like the count-dependent effective signicance level itself. Thus, the probability of producing a false positive when applying the method at ring rates below 1 Hz steeply decreases as the ring rate approaches lower values. As a consequence, in this ring-rate regime, the expected effective signicance level of the test becomes sensitive to the ring rates of the investigated neurons. Since, by construction of the test, the effective signicance level cannot surpass the nominal signicance level a, the problem is not that the test could produce more false positives than expected. Rather, the decrease of the effective signicance level for low ring rates implies that the test will effectively operate at a very conservative signicance level. However, as can be seen from the power curves (cf. Figure 5), this decrease in effective signicance level is accompanied by a decrease in test power. Thus, for very low ring rates, both versions of the test will fail to detect deviations from the null hypothesis of independent ring. Finally, we could show that the use of the hypergeometric coincidence count distribution allows us to relax the stationarity requirement of the neuronal ring rates. In contrast to the original method, which had to assume the ring rates of both neurons to remain stationary over all bins of the analysis window, the modied test requires only one of the two neurons to have a stationary ring rate, while the other can follow an arbitrary time course. This generalization implies an important increase of applicability of the method to empirical data. We are currently investigating whether this “one-sided” stationarity criterion can be further relaxed, possibly by imposing joint (but weaker) requirements on both neuronal rate proles. The robustness of the modied method with respect to the violation of the “one-sided” stationarity assumption will also be the subject of further research. A conceptually different approach to treat nonstationary neural data with count-based statistics could be based on the use of estimators for the instantaneous ring rate. Following recent work by Pauluis and Baker (2000), who implement a rate-based version of unitary-event analysis (Grun, ¨ 1996) in connection with an instantaneous rate estimation procedure, it might be interesting to use instantaneous rate estimation (Nawrot, Aertsen, & Rotter, 1999) together with the conditional coincidence count distribution used here for variable spike event probabilities. This approach seems capable of combining the advantages of count-based statistics with the improved applicability of instantaneous rate estimators to nonstationary data sets. In conclusion, in view of the increase in test power, the increased interpretability of the signicance measure, and the relaxation of the stationarity requirement, we clearly recommend implementation of the count-based rather than the rate-based version of this analysis method when testing the statistical signicance of coincident spikes.
Statistical Signicance of Coincident Spikes
145
Appendix A: Stochastic Model
According to the denition of a spike event given in section 2, we dene the probability space (Vi , Pji ) for each pair of bins within the analysis window (indexed by i 2 f1, . . . , ng) on the basis of the sample space, © ¡ ¢ ª Vi :D vi D v1,i , v2,i : v1,i , v2,i 2 f0, 1g ,
(A.1)
and a probability Pji for each of the four possible outcomes (0,0), (0,1), (1,0), (1,1). A parameterization of these probabilities in terms of the individual spike event probabilities p1,i and p2,i , ¡ ¢ ¡ ¢ Pji v1,i D 1 D p1,i and Pji v2,i D 1 D p2,i ( p1,i , p2,i 2 ( 0, 1) ) ,
(A.2)
and the spike correlation r i between the two neurons, Corr( v1,i , v2,i ) D r i
(ri 2 ( ¡1, 1) ) ,
(A.3)
leads to the denition Pji (vi D ( 1, 1) ) :D p1,i p2,i C ri Ri
(A.4)
Pji (vi D ( 1, 0) ) :D p1,i ( 1 ¡ p2,i ) ¡ r i Ri
(A.5)
Pji (vi D ( 0, 1) ) :D (1 ¡ p1,i ) p2,i ¡ r i Ri
(A.6)
(A.7) Pji (vi D ( 0, 0) ) :D (1 ¡ p1,i ) ( 1 ¡ p2,i ) C r i Ri , p with Ri D p1,i ( 1 ¡ p1,i ) p2,i ( 1 ¡ p2,i ) and ji :D ( p1,i , p2,i , r i ) . Based on equations A.4 through A.7, the conditional probabilities to observe a spike event from neuron 2, given the behavior of neuron 1, are ¡ ¢ ri R i #i :D Pji v2,i D 1 | v1,i D 1 D p2,i C p1,i
(A.8)
¡ ¢ ri Ri Qi :D Pji v2,i D 1 | v1,i D 0 D p2,i ¡ . 1 ¡ p1,i
(A.9)
These conditional probabilities will be used in appendix B, yielding compact expressions of the probability distribution functions. Starting from assumptions A1 and A2 as stated in section 2, we can dene the probability space describing the entire analysis window by forming the product space (V , Pj ) with V :D V 1 £ ¢ ¢ ¢ £ V n ,
Pj :D
n Y iD 1
Pji ,
(A.10)
146
R. Gutig, ¨ A. Aertsen, and S. Rotter
and samples v D ( v1 , . . . , vn ) . Based on this product space, we dene the discrete random variables, Cm ( v) :D
n X
vm,i
m D 1, 2,
iD 1
(A.11)
denoting the total number of spike events from neuron m in the analysis window. Similarly, K ( v) :D
n X iD 1
v1,i ¢ v2,i ,
(A.12)
denotes the number of coincident spike events from the two neurons. Appendix B: Probability Distributions
For the derivations in this section, we will assume stationarity of all parameter tripletsji in the analysis window, as formulated in assumption A1 . We let j D ji for all i D 1, 2, . . . , n. Given the probabilities of the four possible spike event constellations of any two coinciding bins (see equations A.4–A.7) and using the conditional spike probabilities # and Q (see equations A.8 and A.9), application of the multinomial distribution (cf. Feller, 1968) yields the probability of nding k coincident events and c1 , respectively c2 , spike events from the two neurons, that is, of making an observation v with K (v) D k, C1 (v) D c1 , and C2 ( v) D c2 : Pj ( K D k, C1 D c1 , C2 D c2 ) n! D k!( c1 ¡ k) !( c2 ¡ k)!( n ¡ c1 ¡ c2 C k)! £ ¤k £ ¤c ¡k £ ¤c ¡k ( 1 ¡ p1 ) Q ) 2 £ p1 # p1 ( 1 ¡ # ) 1 £ ¤n¡c1 ¡c2 C k £ ( 1 ¡ p1 ) ( 1 ¡ Q ) .
(B.1)
Using
³ ´ Pj ( C1 D c1 ) D
n
c1
p1 c1 ( 1 ¡ p1 )
( n¡c1 )
,
(B.2)
it follows that Pj ( K D k, C2 D c2 | C1 D c1 ) D
³ ´³
D
c1 k
n ¡ c1
c2 ¡ k
´
Pj ( K D k, C1 D c1 , C2 D c2 ) Pj ( C1 D c1 )
#k ( 1 ¡ # ) ( c1 ¡k) Q ( c2 ¡k)
£ ( 1 ¡ Q ) ( n¡c1 ¡c2 C k) .
(B.3)
Statistical Signicance of Coincident Spikes
147
Since the sample space V can be decomposed into disjoint subsets containing all elementary events with a specic coincidence count, we can write the joint count distribution as Pj ( C1 D c1 , C2 D c2 ) D Pj ( C1 D c1 ) Pj ( C2 D c2 | C1 D c1 ) D Pj ( C1 D c1 ) min(c X1 ,c2 )
£
kD 0
Pj ( K D k, C2 D c2 | C1 D c1 ) ,
(B.4)
which by insertion of equation B.3 turns into
³ ´ Pj ( C1 D c1 , C2 D c2 ) D
n
pc11 ( 1 ¡ p1 ) ( n¡c1 )
c1
min(c X1 ,c2 )
£
³ ´ c1 k
kD 0
³ k(
# 1 ¡ #)
( c1 ¡k)
n ¡ c1
c2 ¡ k
£ ( 1 ¡ Q ) ( n¡c1 ¡c2 C k) .
´ Q ( c2 ¡k) (B.5)
Going back to equation B.1, it is now straightforward to derive the conditional coincidence count distribution. By inserting equations B.1 and B.5 into Pj ( K D k | C1 D c1 , C2 D c2 ) D
Pj ( K D k, C1 D c1 , C2 D c2 ) , Pj ( C1 D c1 , C2 D c2 )
(B.6)
we nd Pj ( K D k | C1 D c1 , C2 D c2 )
³ ´ c1
D
k
#k (1 ¡#)
³ ´
min(c X1 ,c2 ) kD 0
³
c1 k
( c1 ¡k)
´
n ¡c1
c2 ¡ k
³
# k ( 1 ¡ #) ( c1 ¡k)
Q ( c2 ¡k) (1 ¡ Q ) ( n¡c1 ¡c2 C k)
´
n ¡c1
c2 ¡ k
.
(B.7)
Q ( c2 ¡k) ( 1 ¡ Q ) ( n¡c1 ¡c2 C k)
For independent neurons (r D 0), the conditional probabilities # and Q become equal to the spike probability of neuron 2 ( # D Q D p2 ). Therefore, it is a matter of straightforward substitution to derive the conditional coincidence count distribution Pj0 ( K D k | C1 D c1 , C2 D c2 ) and the joint count distribution Pj0 ( C1 D c1 , C2 D c2 ) for independent neurons from the general expressions in equations B.7 and B.5, respectively. Finally, the general form
148
R. Gutig, ¨ A. Aertsen, and S. Rotter
of the unconditional coincidence count distribution (cf. section 2) is found by replacing p2 of equation 2.1 with #, so that the term p1 # corresponds to the general form of the probability to observe a pair of coincident spikes within one bin (cf. equation A.4). Appendix C: Nonstationary Rates
Relaxing the stationarity assumption A1 , we return to the general formulation of our stochastic model as developed in section 2 and appendix A. Thus, we replace the parameter triplet j with a vector of triplets jE . Its n components ji D ( p1,i , p2,i , r i ) describe the probability space ( Vi , Pji ) of each individual pair of corresponding bins in the analysis window. Assuming stochastic independence of the two neurons (i.e., H 0), we can rewrite equation B.6 as PjE0 ( K D k | C1 D c1 , C2 D c2 ) D
PjE0 ( K D k, C1 D c1 , C2 D c2 ) PjE0 ( C1 D c1 ) PjE0 ( C2 D c2 )
,
(C.1)
where we used jE0 to indicate this parameter setting under the condition H0 . Following assumption A2 , we can write the probability for a specic realization as the product of the probabilities of obtaining or not obtaining a spike event in each of the corresponding bins, respectively. Thus, the probability PjE0 ( K D k, C1 D c1 , C2 D c2 ) of making an observation with c1 spike events from neuron 1, c2 spike events from neuron 2, and k coincident events is given by the sum over the probabilities of all possible arrangements of this count conguration. For M D f1, 2, . . . , ng, where n is the number of bins in the analysis window, we dene the set Mc1 as the collection of all subsets of M with c1 elements. Further, for any m 2 M c1 , we let the set m,k Mc2 denote the collection of all subsets of M with c2 elements in total and k elements in common with m . Using this notation, we have PjE0 ( K D k, C1 D c1 , C2 D c2 ) X X Y Y Y Y ( 1 ¡ p1, j ) ( 1 ¡ p2,m ). D p1,i p2,l m 2Mc1 l2Mm ,k i2m c2
j2Mnm
l2l
(C.2)
m2Mnl
Using the same notation, the probability PjE0 ( C1 D c1 ) is given by PjE0 ( C1 D c1 ) D
X Y
m 2Mc1 i2m
Y
p1,i
j2Mnm
( 1 ¡ p1, j ) .
(C.3)
The probability PjE0 ( C2 D c2 ) can be expressed analogously. Thus, we can rewrite equation C.1 as PjE0 ( K D k | C1 D c1 , C2 D c2 )
Statistical Signicance of Coincident Spikes
X
X Y
p1,i
m 2Mc1 l2Mm ,k i2m c
D 2
4
2
X Y
Y p1,i
g2Mc1
i2g
j2Mng
Y j2Mnm
( 1 ¡ p1, j )
149
Y
Y
p2,l
l2l
m2Mnl
32 ( 1 ¡ p1, j )54
(1 ¡p2,m ) 3
X Y
k 2Mc2
Y p2,i
i2k
j2Mnk
(C.4)
( 1 ¡ p2, j ) .5
This equation describes the conditional probability that, given the individual spike event counts c1 and c2 , one will observe k coincident events from two stochastically independent neurons with spike event probabilities according to jE0 . Assuming a stationary rate for neuron 2 (i.e., p2,i D p2 for all i), equation C.4 reduces to PjE0 ( K D k | C1 D c1 , C2 D c2 ) 2 3³ ´³ ´ X Y Y c1 n ¡ c1 c 2 4 ( 1 ¡ p1, j ) 5 p1,i p ( 1 ¡ p2 ) n¡c2 k c2 ¡ k 2 m 2Mc1 i2m j2Mnm 2 3³ ´ D X Y Y n 4 (1 ¡ p1, j ) 5 p1,i pc22 ( 1 ¡ p2 ) n¡c2 c 2 g2Mc i2g j2Mng 1
³ ´³ c1
D
k
n ¡ c1
c2 ¡ k
³ ´ n
´ ,
(C.5)
c2 which is the hypergeometric distribution as given in equation 2.2, independent of whether the ring rate of neuron 1 is stationary over the observation interval. Obviously, by interchanging the roles of the two neurons in the defm initions of Mc1 and Mc2,k, the same result can be obtained for a stationary rate in neuron 1—for p1,i D p1 for all i. In other words, stationarity of only one of the two neurons (either one) sufces to obtain the result in equation C.5. Appendix D: Approximative Evaluation of the Expectation Values
The number of terms in the sums underlying the computation of the expectation values for the count-dependent test power and signicance level with respect to the joint count distribution (cf. equations 4.4, 4.5, and 5.1– 5.3) grows with n2 . Thus, the direct evaluation of the sums for realistic values of n, which can reach up to thousands, is rather unpractical. However, by using the fact that the mass of the joint count distribution is mainly concentrated at relatively few count combinations, it is straightforward to
150
R. Gutig, ¨ A. Aertsen, and S. Rotter
calculate an approximate value of these quantities up to arbitrary preciapprox sion d, for example, the approximate test power pj,d . Selecting Bj,d ½ f0, 1, . . . , ng £ f0, 1, . . . , ng such that X (D.1) Pj ( C1 D c1 , C2 D c2 ) ¸ 1 ¡ d, Bj,d
approx
we nd the approximate test power p j,d given by X approx p j,d D Pj ( C1 D c1 , C2 D c2 ) Pj ( K ¸ kcrit ( c1 , c2 ) | c1 , c2 ) .
(D.2)
( c1 ,c2 )2Bj ,d
Since all probabilities are smaller than unity, it is clear from equation D.1 approx that pj ¡ p j,d · d. When computing the effective signicance levels aj0 and 2 j0 , the expectation value is formed over quantities smaller than a. Thus, the precision of the approximation of these signicance levels improves to ad. Note that Bj,d is not uniquely dened through equation D.1. In our calculations, Bj,d was determined by successively adding up the mass of individual spike event count combinations until the cutoff value 1 ¡ d was reached. To keep the number of individual spike count combinations entering Bj,d reasonably low, we started with the central term (cf. Feller, 1968) of the joint count distribution of independent neurons and iteratively added those count combinations from the surrounding of Bj,d that contributed most. Although not required for this procedure, this approach was motivated by the fact that the joint count distribution falls off monotonically with increasing distance from its central term. Appendix E: Randomized Tests
The basic idea behind randomized tests (Mood et al., 1974) is to dene a test through a critical function y that denes rejection probabilities for given empirical observations rather than through a xed critical region of rejection R. Through the incorporation of an additional independent random variable, it becomes possible to adjust the effective signicance level of the test to match its nominal signicance level a precisely, regardless of any discreteness of its test statistic. Applying this concept to the test based on the hypergeometric coincidence count distribution, this means that the decision of the test will no hyp longer be based on the critical region R c1 ,c2 D fk: k ¸ kcrit g of overly critical coincident spike event counts k but rather on the critical function, 8 hyp > 1 k ¸ kcrit ( c1 , c2 ) > < yc1 ,c2 :D fc1 ,c2 k D khyp (E.1) ( ) crit c1 , c2 ¡ 1 > > : hyp 0 k < kcrit ( c1 , c2 ) ¡ 1,
Statistical Signicance of Coincident Spikes
151
which for every element v of the sample space V sets the probability of rejecting the null hypothesis. Thus, while the null hypothesis will always hyp hyp be rejected if k ¸ kcrit ( c1 , c2 ) and never be rejected if k < kcrit ( c1 , c2 ) ¡ 1, the parameter fc1 ,c2 controls the rejection probability for all v with k D hyp kcrit ( c1 , c2 ) ¡ 1. In order to adjust the effective signicance level of the test to its nominal signicance level a, we dene fc1 ,c2 such that hyp
fc1 ,c2 :D
a ¡ Pj0 ( K ¸ kcrit ( c1 , c2 ) ) hyp
Pj0 ( K D kcrit ( c1 , c2 ) ¡ 1)
.
(E.2)
From here it is straightforward to see that the probability of falsely rejecting the null hypothesis of stochastically independent neurons reduces to hyp
hyp
Pj0 ( K ¸ kcrit ( c1 , c2 ) ) C fc1 ,c2 Pj0 ( K D kcrit ( c1 , c2 ) ¡ 1) D a,
(E.3)
and, thus, for all count combinations precisely corresponds to the nominal level of signicance. Note that this procedure adjusts the probability of falsely rejecting the null hypothesis by introducing rejections of the null hypothesis with probability fc1 ,c2 for all count constellations with k D hyp kcrit ( c1 , c2 ) ¡ 1. While this raises the probability of a false rejection to the nominal a-level of the test, and correspondingly increases its test power, the outcome of the test for a given set of data becomes a random variable. Thus, repeated applications of the test to the same data will in general lead to different ndings. This indeterminacy of randomized tests is the reason for their restricted use in applied statistics (Mood et al., 1974). hyp
By straightforward extension of equation 4.5, we nd the power p j,Rnd of the randomized version of the test to be given by hyp
pj,Rnd D
n X c1 ,c2 D 0
Pj ( C1 D c1 , C2 D c2 )
h hyp £ Pj ( K ¸ kcrit ( c1 , c2 ) | c1 , c2 ) hyp
i
C fc1 ,c2 Pj ( K D kcrit ( c1 , c2 ) ¡ 1 | c1 , c2 ) .
(E.4)
Comparison of the power curves for the randomized versus the nonrandomized test in the same parameter regime as used in Figure 3 demonstrates that, as expected, the power curves of the randomized version of the test lie above the values reached by the nonrandomized version. The maximum increase in test power ranges from approximately 0.05 for a D 0.01 (r D 0.1) to approximately 0.07 for a D 0.1 (r D 0.05) . Thus, the effect of randomization becomes larger for more permissive signicance levels a. Finally, we note
152
R. Gutig, ¨ A. Aertsen, and S. Rotter
that the randomized version of the test based on the hypergeometric coincidence count distribution, that is, of Fisher’s exact test, is uniformly most powerful unbiased for testing independence in a 2£2 table (Lehmann, 1997). Acknowledgments
We thank Hans Rudolf Lerche for valuable discussions and for calling our attention to Fisher ’s exact test. We thank Sonja Grun ¨ for helpful discussions and Arup Roy for kindly providing us with his manuscript prior to publication. We gratefully acknowledge Alexandre Kuhn for helpful comments on the manuscript. This work was supported in part by the Studienstiftung des deutschen Volkes, the German–Israeli Foundation for Scientic Research and Development (GIF), the Deutsche Forschungsgemeinschaft (DFG), and the Institut fur ¨ Grenzgebiete der Psychologie, Freiburg. References Abeles, M. (1982). Local cortical circuits: An electrophysiological study. Berlin: Springer-Verlag. Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Abeles, M., Bergman, H., Margalit, E., & Vaadia, E. (1993). Spatiotemporal ring patterns in the frontal cortex of behaving monkeys. J. Neurophysiol., 70(4), 1629–1638. Abeles, M., & Gerstein, G. L. (1988). Detecting spatiotemporal ring patterns among simultaneously recorded single neurons. J. Neurophysiol., 60(3), 909– 924. Aertsen, A., Bonhoeffer, T., & Kruger, ¨ J. (1987). Coherent activity in neuronal populations: Analysis and interpretation. In E. R. Canianello (Ed.), Physics of cognitive processes (pp. 1–34). Singapore: World Scientic Publishing. Aertsen, A., & Gerstein, G. L. (1985). Evaluation of neuronal connectivity: Sensitivity of cross-correlation. Brain Research, 340, 341–354. Aertsen, A., & Gerstein, G. L. (1991). Dynamic aspects of neuronal cooperativity: Fast stimulus-locked modulations of effective connectivity. In J. Kruger ¨ (Ed.), Neuronal cooperativity (pp. 52–67). Stuttgart: Springer-Verlag. Aertsen, A., Gerstein, G., Habib, M., & Palm, G. (1989).Dynamics of neuronal ring correlation: Modulation of “effective connectivity.” J. Neurophysiol., 61(5), 900–917. Feller, W. (1968). An introduction to probability theory and its application (3rd ed.). New York: Wiley. Galassi, M., Davies, J., Theiler, J., Gough, B., Priedhorsky, R., Jungman, G., & Booth, M. (1998). GNU Scientic Library—Reference Manual (0.4after ed.). Cambridge, MA: Free Software Foundation. Available online at: http://sources.redhat.com/gsl. Gerstein, G. L., Bedenbaugh, P., & Aertsen, A. (1989). Neuronal assemblies. IEEE Trans. Biomed. Eng., 36, 4–14.
Statistical Signicance of Coincident Spikes
153
Grun, ¨ S. (1996). Unitary joint-events in multiple–neuron spiking activity: Detection, signicance, and interpretation. Thun: Verlag Harri Deutsch. Grun, ¨ S., Diesmann, M., & Aertsen, A. (in press-a). “Unitary events” in multiple single neuron spiking activity I. Detection and signicance. Neural Computation. Grun, ¨ S., Diesmann, M., & Aertsen, A. (in press-b). “Unitary events” in multiple single neuron spiking activity. II. Non-stationary data. Neural Computation. Grun, ¨ S., Diesmann, M., Grammont, F., Riehle, A., & Aertsen, A. (1999).Detecting unitary events without discretization of time. J. Neurosci. Meth., 94, 67–79. Gutig, ¨ R., Rotter, S., Grun, ¨ S., & Aertsen, A. (2000). Signicance of coincident spikes: Count-based versus rate-based statistics. In Ninth Annual Computational Neuroscience Meeting CNS*2000 (p. 69). Brussels. Kreiter, A. K., & Singer, W. (1996). On the role of neural synchrony in the primate visual cortex. In A. Aertsen & V. Braitenberg (Eds.), Brain theory—Biological basis and computational principles (pp. 201–227). Amsterdam: Elsevier. Lehmann, E. (1997). Testing statistical hypotheses (2nd ed.). New York: SpringerVerlag. Mood, A. M., Graybill, F. A., & Boes, D. C. (1974). Introduction to the theory of statistics (3rd ed.). Tokyo: McGraw-Hill. Nawrot, M., Aertsen, A., & Rotter, S. (1999). Single-trial estimation of neuronal ring rates. J. Neurosci. Meth., 94, 81–91. Nicolelis, M. A. L. (Ed.). (1998).Methodsfor neural ensemble recordings. Boca Raton, FL: CRC Press. Palm, G., Aertsen, A., & Gerstein, G. L. (1988). On the signicance of correlations among neuronal spike trains. Biol. Cybern., 59, 1–11. Pauluis, Q., & Baker, S. N. (2000). An accurate measure of instantaneous discharge probability, with application to unitary joint-event analysis. Neural. Comp., 12, 647–669. Prut, Y., Vaadia, E., Bergman, H., Haalman, I., Hamutal, S., & Abeles, M. (1998). Spatiotemporal structure of cortical activity: Properties and behavioral relevance. J. Neurophysiol., 79(6), 2857–2874. Riehle, A., Grun, ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and rate modulation differentially involved in motor cortical function. Science, 278, 1950–1953. Roy, A., Steinmetz, P. N., & Niebur, E. (2000). Rate limitations of unitary event analysis. Neural. Comp., 12, 2063–2082. Singer, W. (1993). Synchronization of cortical activity and its putative role in information processing and learning. Annu. Rev. Physiol., 55, 349–374. Singer, W. (1999). Time as coding space. Curr. Op. Neurobiol., 9(2), 189–194. von der Malsburg, C. (1981). The correlation theory of brain function (Internal Rep. No. 81-2). Gottingen: ¨ Max-Planck-Institute for Biophysical Chemistry. Received August 1, 2000; accepted March 19, 2001.
LETTER
Communicated by Peter Dayan
Representational Accuracy of Stochastic Neural Populations Stefan D. Wilke
[email protected] Christian W. Eurich
[email protected] ¨ Theoretische Physik, Universit¨at Bremen, 28334 Bremen, Germany Institut fur Fisher information is used to analyze the accuracy with which a neural population encodes D stimulus features. It turns out that the form of response variability has a major impact on the encoding capacity and therefore plays an important role in the selection of an appropriate neural model. In particular, in the presence of baseline ring, the reconstruction error rapidly increases with D in the case of Poissonian noise but not for additive noise. The existence of limited-range correlations of the type found in cortical tissue yields a saturation of the Fisher information content as a function of the population size only for an additive noise model. We also show that random variability in the correlation coefcient within a neural population, as found empirically, considerably improves the average encoding quality. Finally, the representational accuracy of populations with inhomogeneous tuning properties, either with variability in the tuning widths or fragmented into specialized subpopulations, is superior to the case of identical and radially symmetric tuning curves usually considered in the literature.
1 Introduction
Despite impressive progress in the neurosciences, the code by which neural systems transmit and represent stimulus information has remained largely enigmatic (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997; deCharms & Zador, 2000). A common way to achieve an understanding of this code is to study the encoding properties of sensory or motor neural systems under biological constraints such as receptive eld sizes, correlations in the neuronal ring, or energy consumption. A measure of the neural encoding ability is the stimulus reconstruction error—the “difference” between the actual stimulus and a stimulus estimate obtained solely from measured neural responses. Theoretically, the calculation of reconstruction errors is closely related to estimating an unknown parameter from a set of samples of a random variable, which is a standard problem of statistical estimation theory (Cover & Thomas, 1991; Kay, 1993). Solving this problem can provide an answer Neural Computation 14, 155–189 (2001)
° c 2001 Massachusetts Institute of Technology
156
Stefan D. Wilke and Christian W. Eurich
to the question of which coding strategy is favorable under given circumstances. Many works of this kind have concentrated, for instance, on tuning properties of neurons within a population, trying to derive under which circumstances broad or narrow tuning is favorable (Hinton, McClelland, & Rumelhart, 1986; Baldi & Heiligenberg, 1988; Snippe & Koenderink, 1992a; Eurich & Schwegler, 1997; Zhang & Sejnowski, 1999), and, more recently, on the effects of correlated variability in a population (Snippe & Koenderink, 1992b; Abbott & Dayan, 1999; Yoon & Sompolinsky, 1999; Karbowski, 2000). Most of these studies have given little attention to what is in fact one of the most important features of neural systems in this context: the response variability, which makes the reconstruction problem interesting in the rst place. While most studies have simply assumed Poissonian or additive gaussian noise on the responses, we will show in this article that the choice of a variability model has a major, nontrivial impact on the encoding properties of the neural population. The immense variability of individual response parameters, such as tuning widths or correlation coefcients, has also been neglected in most previous work. Although these parameter variations are always found in empirical data, they were considered functionally insignificant, and hence theoretical studies have almost always assumed uniform parameters throughout the population. We will show here that this uniform case is unfavorable in the sense that the introduction of parameter variability improves the encoding performance. In this study, neural population codes are analyzed using the Fisher information measure, which has turned out to be a powerful tool for studying questions of neural coding (Paradiso, 1988; Seung & Sompolinsky, 1993; Brunel & Nadal, 1998; Zhang & Sejnowski, 1999; Pouget, Deneve, Ducom, & Latham, 1999; Karbowski, 2000; Johnson, Gruner, Baggerly, & Seshagiri, 2001). As compared to mutual information and Bayesian techniques, it is especially suitable if the stimulus itself is not a random variable. Here, Fisher information provides a unied framework in which multiple aspects of population responses—response variability models, noise correlations, and tuning widths—can be discussed. In the following section, we give a review of the concept of Fisher information and specify the neural population model to be studied. In sections 3, 4, and 5, we discuss the representational accuracy of the population for different response variability models, noise correlation schemes, and tuning width distributions, respectively. Section 6 provides a summary of the main results. The major equations of the present study are derived in two mathematical appendixes.
2 Theoretical Framework 2.1 Fisher Information and the Cram Âer-Rao Bound. Consider a population of N neurons encoding a single stimulus x D ( x1 , . . . , xD ) in a Ddimensional space, the so-called implicit space (Zemel & Hinton, 1995). The
Representational Accuracy of Stochastic Neural Populations
157
stimulus components xi will be referred to as features; they correspond to the different physical properties of the stimulus to which the neurons are sensitive, such as position, velocity, brightness, and contrast, in the case of visual neurons. The neurons re their action potentials stochastically, so that the spike count vector n :D ( n1 , . . . , nN ) measured in a single trial of observation time t is a discrete random variable. The spike counts are assumed to carry all stimulus information, so that the population’s coding scheme is characterized by the joint probability P ( n | x ) for observing the spike count vector n given the stimulus value x . Neural systems are believed to use other codes to transmit information than the spike counts (Eckhorn, Grusser, ¨ Kroller, ¨ Pellnitz, & Popel, ¨ 1976; Rieke et al., 1997; Gautrais & Thorpe, 1998). However, in many cases a large amount of sensory information is encoded in the mean spike count (TovÂee, Rolls, Treves, & Bellis, 1993; Heller, Hertz, Kjaer, & Richmond, 1995). The representational capacity of the neural population can be quantied by how well the stimulus x can be recovered from an observation n . In the framework of classical estimation theory, a rule to construct an estimate xˆ ( n ) of the actual stimulus x from an observation n is called an estimator. A given estimator xˆ is usually characterized by two quantities: its bias b xˆ ( x ) :D E [xˆ ] ¡ x and its variance vi, xˆ ( x ) :D E[xO 2i ] ¡ E2 [xO i ] for i D 1, . . . , D, where E[¢] denotes the expectation value with respect to P ( n | x ) at xed x . While the bias is a measure of systematic deviations, the variance describes the scatter around the mean estimate E[xˆ ]. It is a common measure of encoding quality since it can be interpreted as a mean-squared reconstruction error. There is a convenient formula for a lower bound for the variance of any unbiased estimator, b xˆ ( x ) D 0, the famous Crame r-Rao bound (CRB) (Rao, 1945; Crame r, 1946): vi, xˆ ( x ) ¸ [J( x ) ¡1 ]ii
for all x and i D 1, . . . , D.
(2.1)
It states that unbiased estimates of the stimulus x have a variance that is at least the corresponding matrix element of the inverse of the Fisher information matrix J ( x ) , which is calculated from the probability distribution P ( n | x ) via µ Jij ( x ) :D ¡E
@2 @xi @xj
¶ ln P ( n | x )
i, j D 1, . . . , N.
(2.2)
Fisher information is independent of the estimation mechanism employed and can therefore serve as an objective measure of coding quality of the encoding scheme P ( n | x ) of the neural population. The formalism of Fisher information is not suited to take into account prior knowledge on the stimulus, which is known to improve the quality of reconstruction (Kay, 1993). Moreover, there are situations in which biased estimators can perform better than the CRB (Cover & Thomas, 1991). In
158
Stefan D. Wilke and Christian W. Eurich
the case of a dense, uniform population of neurons considered below, this is not a problem since translational invariance implies that all reasonable estimators—those that respect this invariance—are either equally biased for all stimulus values or unbiased. In the former case, subtracting the bias yields an unbiased estimator with better performance. Another nontrivial question is whether there is an efcient estimator in the rst place. For instance, the at tuning curves generally used in coarse coding schemes (Hinton et al., 1986; Eurich & Schwegler, 1997) do not allow for unbiased estimates. Recently, Bethge, Rotermund, & Pawelzik (2000) showed that efcient estimation can also become impossible if insufcient time is available for decoding. However, the time available for decoding may vary from one system and task to the other, and there is no simple way to calculate how much time-efcient estimation takes in a given system. Thus, it is important to know what encoding accuracy can be expected if no restrictions on timing are imposed; this is achieved by calculating Fisher information (Bethge et al., 2000). 2.2 Tuning Functions. Neural responses are conventionally characterized in terms of tuning functions describing the mean ring rate as a function of the stimulus value. In the present framework, the tuning functions are therefore given by f ( x ) :D E[n ] / t . Typically, fk ( x ) is a unimodal function centered around some preferred stimulus c ( k) that depends on the neuron k. For a more concrete discussion of the encoding properties of the model population, we specify the neuronal tuning characteristics as follows. For neuron k, we assume
fk ( x ) D Fw (j ( k)2 ) D Ff C F (1 ¡ f ) exp( ¡j ( k)2 / 2) ,
(2.3)
where j ( k)2 is a rescaled distance between the center of the tuning curve PD ( k)2 ( k) ( ) ( ) and the actual stimulus, j ( k)2 :D , ji :D ( xi ¡ ci k ) / si k , F is the iD 1 ji ( k) maximum ring rate, and si is the neuron’s tuning width for feature i (Zhang & Sejnowski, 1999; Wilke & Eurich, 1999; Eurich & Wilke, 2000). The parameter f measures the level of stimulus-unrelated background activity at a given maximum ring rate F; it ranges from zero-baseline ring (f D 0) to completely stimulus-independent ring (f D 1). The tuning curve is depicted in Figure 1. For simplicity, a gaussian tuning curve prole was assumed, but the results do not depend critically on this choice. In order to assess the redundancy of the neural population code, we introduce the local density of tuning curves in stimulus space, g( x ) :D PN ( k) ¡ x ) . In the following, it will often be assumed that the tunkD 1 d( c ing curves uniformly cover the stimulus space (see section A.2). This case will be referred to as the continuous limit. Nonuniform tuning curve distributions can optimize performance if the stimuli are not evenly distributed (Brunel & Nadal, 1998), but sparse tuning curves lead to unrepresented
Representational Accuracy of Stochastic Neural Populations
159
200 Mean response Additive (flat) noise Proportional noise Multiplicative noise
Response / Hz
150
100
50
0 4
3
2
1
0
1
Stimulus / s
2
3
4
Figure 1: Example of tuning curve and variability of a model neuron. Mean response (solid) C / ¡ standard deviation. The mean follows a gaussian centered at the preferred stimulus with F D 100 Hz and a baseline ring level of 10 Hz (f D 0.1). The shape of the response variance depends on the noise model. Arrows indicate stimuli for which the neuron’s Fisher information is maximal.
areas in stimulus space, the effects of which are discussed in section 5 (see also Eurich & Wilke, 2000). Finally, we introduce a distribution of tuning widths, Ps , from which the tuning widths of all neurons are assumed to be drawn (Eurich, Wilke, & Schwegler, 2000), (1) ( ) P (s1 , . . . , sDN ) D
N Y
( ) ( ) Ps ( s1 k , . . . , sDk ).
(2.4)
kD 1
This allows for a discussion of the effect of tuning properties of the neural population in section 5. Note that stimulus features for which the neurons ( ) are not sensitive (si k D 1) must be excluded from the analysis. 2.3 Spike Count Statistics. For the calculations to follow, a gaussian probability density will be assumed for the spike count. In this framework, correlations between the neurons of the population are described by the covariance matrix, h i (2.5) Q ( x ) :D E ( n ¡ t f ( x ) ) ( n ¡ t f ( x ) ) T .
160
Stefan D. Wilke and Christian W. Eurich
For simplicity of notation, we will not write out the x -dependence of f and Q in the following. Assuming that n can be treated as a continuous random variable, one has a probability density function p( n | x ) D p
µ ¶ 1 exp ¡ ( n ¡ t f ) T Q ¡1 ( n ¡ t f ) . 2 ( 2p ) N det Q 1
(2.6)
This approximation is expected to fail if the ring rate of any of the neurons becomes small and the discrete nature of the spike count variables becomes important. Inserting equation 2.6 into the denition of Fisher information, equation 2.2, one nds the explicit form (Kay, 1993), Jij ( x ) D t 2
@fT @xi
Q ¡1
@f @xj
C
µ ¶ 1 @Q ¡1 @Q ¡1 Tr . Q Q 2 @xi @xj
(2.7)
Because of the unspecied covariance matrix Q, equation 2.7 is too general for a discussion of the encoding properties of the neural population. Thus, the next task is to specify Q on the basis of empirical ndings. This will be performed in two steps. First, independent spiking is assumed, which reduces the task to specifying the diagonal elements of Q (section 3). In section 4, the independence assumption is dropped, and correlations are considered. 3 Effect of Neuronal Noise Models on the Encoding Accuracy
The implications of the variability of neuronal responses have been debated with some controversy (Softky & Koch, 1993; Mainen & Sejnowski, 1995; Reich, Victor, Knight, Ozaki, & Kaplan, 1997; Gur, Beylin, & Snodderly, 1997; Bair & O’Keefe, 1998; Pouget et al., 1999). In this section we focus on the deviations from the mean spike count in individual stimulus presentations. An empirical mean-variance relation involving two free parameters will be used to discuss the effect of different variance models on the encoding accuracy of neural populations. 3.1 Mean-Variance Scaling. Throughout this section, it is assumed that there are no correlations between neurons, rendering the covariance matrix 2.5 diagonal. The diagonal elements Q kk ( x ) to be specied now are the variances of the spike counts. Several empirical studies (Dean, 1981; Tolhurst, Movshon, & Thompson, 1981; Tolhurst, Movshon, & Dean, 1983; van Kan, Scobey, & Gabor, 1985; Teich & Khanna, 1985; Vogels, Spileers, & Orban, 1989; Gershon, Wiener, Latham, & Richmond, 1998; Lee, Port, Kruse, & Georgopoulos, 1998; Maynard et al., 1999) have yielded the result that the logarithm of the spike count variance is a linear function of the logarithm of the mean spike count. Theoretically, additive (at) noise (Abbott & Dayan, 1999; Yoon & Sompolinsky, 1999) and multiplicative noise (Abbott &
Representational Accuracy of Stochastic Neural Populations
161
Dayan, 1999) have been employed next to Poissonian noise to describe neural spiking statistics. Both the empirical results and the theoretical models are captured by the mean-variance relation of fractional power law noise, Qkk ( x ) D afk ( x ) 2a
k D 1, . . . , N,
(3.1)
with two (possibly t -dependent) parameters a and a yet to be specied. The choice a D 1 / 2 and a D t yields the mean-variance relation of the Poisson distribution (proportional noise), and the cases a D 0 and a D 1 are referred to as additive and multiplicative noise, respectively. Note that equation 3.1 is valid only for large mean spike counts, whereas for small spike counts, the spike count probability must approach proportional noise (Panzeri, Biella, Rolls, Skaggs, & Treves, 1996; Panzeri, Schultz, Treves, & Rolls, 1999; Panzeri, Treves, Schultz, & Rolls, 1999). Although Poissonian spike count statistics are frequently used to describe neural responses, empirical data often show substantial deviations from the Poissonian case, especially at high spike counts (Teich, Johnson, Kumar, & Turcott, 1990; Softky & Koch, 1993; Gershon et al., 1998; Rieke et al., 1997; Lee et al., 1998). Hence, there are aspects to neuronal noise that are not captured by a simple Poissonian spike count distribution. However, there are no clear ideas on what effect it may have on the encoding properties of a neural population. Sections 3.2 and 3.3 will therefore discuss the dependence of Fisher information on the mean-variance relation exponent a. It is shown in section A.1 that inserting equation 3.1 into the formula for the Fisher information in the gaussian noise case, equation 2.7, one nds in the limit of large N, Jij ( x ) D
t2 Fij ( xI a) C 2a2 Fij ( xI 1), a
(3.2)
with the abbreviation Fij ( x I a) from equation A.5. In the continuous limit (see section A.2), equation 3.2 can be written as *Q Jij ( x ) D dijg µ £
D l D1 sl
si sj
+
4p D / 2 ¡ ¢ C 1 C D2
¶ t 2 F2¡2a Aw (a, D) C 2a2 Aw ( 1, D ) , a
(3.3)
R1 ¡ ¢2 ¡ ¢ ¡2b where Aw ( b, D) :D 0 dj j D C 1 w 0 j 2 w j 2 is an integral that depends on the exponent a, the number of features D, and, through the shape of the tuning curve w , the baseline activity level f . The tuning widths enter the Q Fisher information only through the factor hsi¡1 sj¡1 D l D 1 sl i, which denotes an expectation value with respect to the tuning width distribution, equation 2.4. Equation 3.3 shows that the Fisher information matrix is diagonal
162
Stefan D. Wilke and Christian W. Eurich
for the population considered here and that its elements increase linearly with the density of neurons in stimulus space, regardless of all other parameters. 3.2 Encoding Accuracy and Maximum Firing Rate. Generating action potentials requires relatively large amounts of energy (Laughlin, de Ruyter van Steveninck, & Anderson, 1998), and it has been proposed that this energy constitutes a major constraint on the neural code (Levy & Baxter, 1996). Thus, it is an interesting question how sensory resolution depends on the maximum ring rate. The F-dependence of Fisher information for the model population studied here is directly obtained from equation 3.3. Since the term 2a2 Aw ( 1, D ) can be neglected for high maximum spike counts, t 2 F2¡2a À 2a, Fisher information exhibits the simple scaling behavior Jij / F2¡2a . Hence, the mean-variance relation, characterized by the exponent a, has a major impact on the F-dependence of the encoding accuracy. The scaling law Jij / F2¡2a yields a clear advantage for smaller values of a. While for additive noise, Fisher information increases quadratically with the maximum ring rate, multiplicative noise results in constant Fisher information—a xed minimal encoding error that cannot be improved by higher ring frequencies. These results are independent of the number of encoded features, D, and of the level of background activity, f , as long as the latter is nonzero. The reason for the great increase of Fisher information with F for small a lies in the mean-variance relation shown in Figure 1. At xed t , a constant variance for all ring rates loses relative importance if the mean ring rate increases. Hence, increasing F yields better performance. In the multiplicative noise case, however, any increase of ring rate is accompanied by an overproportional increase of variance, which compensates the higher number of counted spikes. How Fisher information changes as a function of observation time t remains an open question, since there are no empirical results on the t -dependence of the parameters a and a in the mean-variance relation, equation 3.1. 3.3 Encoding Accuracy and Baseline Activity. A desirable property for a neural population code is that the representational capacity not be overly sensitive to stimulus-unrelated activity, which naturally arises due to intrinsic noise (White, Rubinstein, & Kay, 2000) or ongoing network dynamics. In the following, it will be shown that the impact of background activity critically depends on a, and thus on the mean-variance relation. The analysis of the f -dependence in the current model is complicated by a problem with the gaussian approximation, equation 2.6, in the limit zero-baseline ring limit f ! 0. Neurons that are not activated by the stimulus will cease to re in this case, so that one leaves the range of validity of equation 2.6. In fact, for f D 0 and gaussian tuning curves, one has Aw ( b, D) D C (1 C D2 ) / [8( 1 ¡ b ) 1 C D / 2 ], which obviously diverges for b D 1.
Representational Accuracy of Stochastic Neural Populations
163
Fisher Information J/J(z=0)
1 Additive noise Proportional noise; D=1, 2, 3, 10
0,8 0,6 D=1
0,4 0,2
D=10
0 0
0,2
0,4
0,6
0,8
Baseline Activity Level z
1
Figure 2: Normalized Fisher information as a function of baseline ring level f for additive gaussian noise (dashed) and Poissonian noise (solid) for different numbers of encoded features. The additive noise result, equation 3.4, does not depend on D, while encoding accuracy rapidly decreases with f in the Poissonian noise case for higher D.
Hence, unless a D 0, the second term / a2 Aw (1, D) in the approximation 3.3 goes to innity as the noiseless limit f D 0 is approached. This result is related to the fact that the gaussian approximation, equation 2.6, for the spike count statistics is invalid if there is no baseline ring activity. The problem is absent if additive (a D 0) gaussian noise is assumed. In this case, equation 3.3 becomes *Q Jij ( x ) D dijg
D l D1 sl
si sj
+
( tF) 2 p D / 2 ( 1 ¡ f ) 2 , 2a
(3.4)
as shown in section A.2. Therefore, there are only two situations that can be analyzed analytically: equation 3.4 for additive gaussian spike count noise and the case of a Poissonian spike count distribution (Wilke & Eurich, 2001). These two cases are compared with respect to their robustness to background noise in Figure 2. As expected, Fisher information is a monotonically decreasing function of the relative level of baseline activity f . The plot shows that the susceptibility to background noise is qualitatively different in the two models. While the decrease does not depend on D at all in the additive noise case, it becomes stronger and stronger in the Poissonian
164
Stefan D. Wilke and Christian W. Eurich
noise case as the number of encoded features increases. Thus, a population with additive gaussian spike count noise is relatively robust with respect to baseline ring, while a Poissonian spike count distribution implies a coding accuracy that quickly declines with f , especially if many features are to be encoded. The observation that higher values of a lead to higher sensitivity to background noise is conrmed by numerical calculations on populations of neurons. Using Poissonian behavior for spike counts lower than a certain threshold to avoid the divergence problems for a > 0, it turns out that additive noise, a D 0, is the only case in which the decrease of Fisher information does not depend on D, while for a > 0, the susceptibility to background noise at higher D gradually increases with a (data not shown). This result can be understood intuitively as follows. Fisher information for single neurons is equal to the square of the derivative of the tuning function divided by the variance (Seung & Sompolinsky, 1993). This implies that the neurons with maximum Fisher information are at distances from the stimulus that increase with a (see the arrows in Figure 1). On the other hand, neurons activated by a stimulus far away from their tuning curve center are most vulnerable to background noise. Their relative loss of Fisher information, ( dJ / df ) / J, with increasing f is larger than for neurons closer to the tuning curve center. This explains why Poissonian noise (corresponding to a D 1 / 2) is more affected by baseline ring than additive noise (a D 0). The D-dependence in Figure 2 can be understood from the following argument. As the dimension of the stimulus space, D, increases, a receptive eld dened by the tuning curve in this space has an increasing percentage of its volume close to its surface (i.e., far away from its center). This mathematical phenomenon is also known from statistical physics (Callen, 1985). It implies that for large D, a stimulus has a marginal position in most receptive elds, which results in a mean spike count that is small compared to t F and therefore most vulnerable to background noise. This is especially disadvantageous for large a and Poissonian noise, where most of the Fisher information is carried by neurons far away from the tuning curve center. Thus, the higher D, the more neurons with small mean spike count contribute to the Fisher information, making the code more and more susceptible to baseline ring. Next to the exponentially growing number of neurons required to cover a high-dimensional stimulus space, this could be another reason that neurons in biological systems are usually specialized in only a few stimulus features. 4 Effect of Noise Correlations on the Encoding Accuracy
Correlations in neural systems include noise correlations and signal correlations. While signal correlations, which are equivalent to tuning curve overlap, are captured by the parameter g in the present framework, noise correlations (correlated variability) can be quantied by the nondiagonal elements of the covariance matrix Q, or, equivalently, by the normalized
Representational Accuracy of Stochastic Neural Populations
165
p (Pearson) correlation coefcient dened by qkl ( x ) :D Q kl ( x ) / Q kk ( x ) Qll ( x ) for k, l D 1, . . . , N. Several studies have found noise correlations in a wide variety of neural systems—for example, in monkey areas IT (Gawne & Richmond, 1993), MT (Zohary, Shadlen, & Newsome, 1994), motor cortex (Lee et al., 1998; Hatsopoulos, Ojakangas, Paninski, & Donoghue, 1998; Maynard et al., 1999), and parietal areas (Lee et al., 1998), as well as in cat visual cortex (Ghose, Ohzawa, & Freeman, 1994; DeAngelis, Ghose, Ohzawa, & Freeman, 1999) and LGN (Dan, Alonso, Usrey, & Reid, 1998). While some experiments have directly or indirectly shown that correlations carry stimulus information, the origin, the precise functional role, and the readout mechanisms (e.g., Feng & Brown, 2000) of these correlations are unclear. As a further complication, they certainly depend on the species and the brain area, but may also depend on the state of the animal (Cardoso de Oliveira, Thiele, & Hoffmann, 1997; Steinmetz et al., 2000). Theoretically, this situation is particularly unsatisfying, since the exact type of correlation assumed greatly inuences the theoretical predictions. In consequence, there has been a controversial discussion on the role that correlated variability may play in population coding. If the readout consists of a simple averaging over the population output, positive correlations pose severe limits on the encoding accuracy. It was therefore suggested that correlations in neural systems are harmful in general (Zohary et al., 1994; Shadlen & Newsome, 1998). However, the situation is more complicated if a distributed population code is considered, that is, a code in which the neurons have different preferred stimuli. In this case, the effect of correlations critically depends on the structure of the covariance matrix. For example, uniform positive correlations across the population increase the achievable encoding accuracy (Abbott & Dayan, 1999; Zhang & Sejnowski, 1999), while correlations of limited range can have the opposite effect (Johnson, 1980; Snippe & Koenderink, 1992b; Abbott & Dayan, 1999; Yoon & Sompolinsky, 1999). In the following, we attempt to clarify this issue by using a covariance matrix that incorporates both uniform and limited-range noise correlations and therefore interpolates between the cases discussed in the literature. 4.1 Model of Noise Correlations. Uniform and limited-range noise correlations have been shown empirically to coexist. For example, a recent study of the primary motor cortex of primates reports both xed-strength correlations and correlations that depend on the similarity of the preferred stimuli (Maynard et al., 1999). Thus, we choose the covariance matrix ( " ³ () ´#) | c k ¡ c ( l) | Qkl ( x ) D d kl C ( 1 ¡ d kl ) q C b exp ¡ L
£ afk ( x ) a fl ( x ) a ,
(4.1)
for k, l D 1 . . . , N, where the part in curly brackets is the normalized correlation coefcient q kl ( x ) dened above. The parameters 0 · q < 1 and b govern
166
Stefan D. Wilke and Christian W. Eurich
the strengths of uniform and limited-range correlations in the population, and L determines the length scale in stimulus space over which neurons are correlated. The covariance matrix, equation 4.1, interpolates between the additive (a D 0, b D 0) and multiplicative (a D 1, b D 0) noise models considered by Abbott and Dayan (1999) and also incorporates the limitedrange correlations treated by Snippe and Koenderink (1992b) and Abbott and Dayan (1999) as a special case, if equidistant tuning curves in D D 1 dimensions are assumed. For b D 0 it can be viewed as a special case of equations 4.2 and 4.3 of Zhang and Sejnowski (1999). 4.2 Uniform Correlations. In two special cases, equation 4.1 allows for an analytical solution for the Fisher information. First, consider a uniform correlation strength q ¸ 0 without limited-range contributions (b D 0), so that each neuron is positively correlated with all others. As shown in section A.1, one nds the large-N-limit,
Jij ( x ) D
£ ¤ t2 Fij ( x I a) ¡ Gij ( x I a) a (1 ¡ q ) ¤ a2 £ (2 ¡ q ) Fij ( x I 1) ¡ qGij ( x I 1) C 1 ¡q
(4.2)
in this case, with the abbreviations Fij and Gij from equations A.5 and A.6. In the continuous case, where Gij D 0, the only correction for q > 0 (as compared to q D 0) are two monotonically increasing factors, 1 / (1 ¡ q) and (2 ¡ q) / (1 ¡ q) . Thus, uniform positive correlations always increase the coding accuracy regardless of the values of the other system parameters. Intuitively, this result can be understood by considering the possibility of subtracting the noise if the correlations are strong (Abbott & Dayan, 1999). Hence, for uniform correlation strength, the mean-variance exponent a does not lead to new effects as compared to independent neurons. This is not the case for limited-range correlations, as we will show. 4.3 Limited-Range Correlations. Pure limited-range correlations, obtained by setting q D 0 and b D 1 in equation 4.1, may also be treated analytically if D D 1 and equidistant tuning curves of density g are assumed. In this case, one nds for the Fisher information (see section A.3)
J11 ( x ) D
t2 a
µ
1¡r r F11 ( xI a) C H ( xI a) 1Cr 1 ¡ r2 µ ¶ r2 ( C a2 2F11 ( xI 1) C 1) H xI , 1 ¡ r2
¶
(4.3)
where H ( xI a) is dened in equation A.16, and r :D exp[¡1 / ( Lg )]. In the case of additive noise (a D 0), the second term in equation 4.3 vanishes,
Representational Accuracy of Stochastic Neural Populations
167
and equation 4.10 of Abbott and Dayan (1999) is recovered. Regardless of a, their conclusion that limited-range correlations decrease the Fisher information remains valid. For large N, the F11-terms of equation 4.3 dominate. Because of the factor 1 ¡ r, Fisher information decreases with increasing r and, equivalently, with increasing correlation length L. This behavior is reversed for large L, where the H-terms / 1 / ( 1¡r 2 ) constitute the dominating contribution to the Fisher information. Apart from this decrease with L, limited-range correlations of the type described by equation 4.1 can have another negative effect on the encoding properties of a population. For uniform correlations, Fisher information increased linearly with the number of neurons (see equations 3.2 and 4.2). Thus, an increase in the number of neurons, or, equivalently, an increase in the density of tuning curves g, resulted in a proportional increase of Fisher information. In equation 4.3, increasing g implies r D exp[¡1 / ( Lg ) ] ¼ 1 ¡ 1 / ( Lg ) at g À L ¡1 , which leads to the asymptotic form of equation 4.3: µ ¶ t2 1 Lg J11 ( x ) D F11 ( xI a) C H ( xI a) 2 a 2Lg µ ¶ Lg 2 C a 2F11 ( xI 1) C (4.4) H ( xI 1) . 2 The fact that F11 / g and H11 / g ¡1 in this limit leads to the following conclusions: For a D 0, Fisher information fails to scale linearly with g but saturates at a nite, g-independent limiting value (Yoon & Sompolinsky, 1999; Abbott & Dayan, 1999). However, the F11 ( xI 1) -term increases linearly with g. Thus, the statement that J11 becomes independent of g for large populations is correct only for additive noise, a D 0 (see Figure 3). Hence, for noise models other than additive noise, Fisher information increases linearly with the number of neurons, even in the presence of limited-range correlations. Since the empirically found values for a are in the vicinity of 1 / 2, this result casts doubt on the conclusion that limited-range correlations generally impose an upper bound on the encoding accuracy for large populations. However, the increase of Fisher information associated with the F11 ( xI 1) -term in equation 4.4 may be hard to exploit in a biologically plausible network (Deneve, Latham, & Pouget, 1999). 4.4 Interaction of Uniform and Limited-Range Correlations. A new issue that can be studied with the general covariance matrix, equation 4.1, concerns the interactions between limited-range and uniform correlations. A qualitative prediction in this context was given by Oram, FoldiÂak, ¨ Perrett, and Sengpiel (1998). They argued that for neurons with positive signal correlation, that is, overlapping tuning curves, negative noise correlation should be favorable because this would allow for a better averaging. At the same time, neurons with negative signal correlation should be positively correlated to avoid reconstruction errors. Hence, a situation that should be
Stefan D. Wilke and Christian W. Eurich
Fisher Information J/J(N=100)
168
5 4
Additive Noise (a=0) Poissonian Noise (a=1/2) Multiplicative Noise (a=1)
3 2 1 0 0
100
200
300
400
Population Size N
Figure 3: Effect of limited-range correlations on the encoding accuracy of a neural population. The plot shows Fisher information as a function of the number of neurons N for different a (g D N / ( 10s) , D D 1, L D s, t F D 100, f D 0.1). The population is limited to about 100 degrees of freedom, that is, limN!1 J ( N ) ¼ J ( N D 100) , only in the additive noise case.
especially disadvantageous is the simultaneous presence of positive uniform and limited-range correlations, yielding high noise correlations for the neurons with positive signal correlation. On the other hand, negative limited-range correlations paired with positive uniform correlations should increase the encoding accuracy of the population. To test this prediction, the population Fisher information was calculated for a xed positive uniform correlation level (q D 0.5) as a function of sign and strength of an additional limited-range correlation contribution, as specied by the parameter b. The result is shown in the inset of Figure 4. For positive short-range correlations, the encoding becomes worse, while it improves for negative short-range correlations. Analytically, this effect can be illustrated by the expansion of Fisher information for small b and r (the latter measuring the correlation range) carried out in section A.4. In leading order, one nds 0 ( ) J11 ( x ) D J11 x ¡
2br (1 ¡ q ) 2
»
t2 [F11 ( xI a) ¡ G11 ( xI a) ] a C a2 q [F11 ( xI 1) ¡ G11 ( xI 1)]
¼ ,
(4.5)
169
1 0,8
N=50 N=100
0,6 1,1
J/J(b=0)
Uniform Correlation Strength q
Representational Accuracy of Stochastic Neural Populations
0,4 0,2 0 0
1
0,9
0,4 0,2 0
b
0,2
0,4
0,6
Correlation Range r
0,2 0,4 0,8
1
Figure 4: Uniform correlation strength q required to compensate the loss of Fisher information due to limited-range correlations as a function of r D exp( ¡1 / ( Lg ) ) . Parameters were a D 1 / 2, g D N / ( 10s) , t F D 100, f D 0.1, D D 1, b D 1 ¡ q. Inset: Fisher information for a population with both uniform and limited-range correlations as a function of the coefcient of the limitedrange contribution, b, for xed q D 1 / 2. For b > 0, limited-range correlations add to the uniform part, while for b < 0, limited-range correlations are negative and weaken the correlations locally. Parameters were N D 100,g D 10/ s, D D 1, a D 1 / 2, L D 0.03s, t F D 100, f D 0.1.
0 where J11 is the Fisher information for purely uniform correlations, b D 0. Since F11 ¸ G11 by the Cauchy-Schwarz inequality, the term in curly brackets is always positive. Hence, positive limited-range correlations decrease the coding accuracy, whereas negative limited-range correlations increase the coding accuracy, in agreement with Oram et al. (1998). While the expansion formula, equation 4.5, is valid only for r ¿ 1, the trend that uniform and limited-range correlations counteract continues to hold for larger r. Figure 4 demonstrates this result. The plot shows the level q of uniform correlations that is necessary to compensate the negative effect of limited-range correlations of given range r. This quantity was determined by choosing a xed r and increasing the level of uniform correlations q, while simultaneously decreasing the strength of limited-range correlations (b D 1 ¡ q), until the Fisher information reached the value obtained in the uncorrelated case (q D 0, b D 0). The initial increase with r reects the growing loss of Fisher information with r that has to be compensated.
170
Stefan D. Wilke and Christian W. Eurich
Eventually limited-range correlations also increase the Fisher information since the case of uniform correlations is approached, so that the curve approaches zero again for large r. The right part of the plot depends on N since limited-range correlations include terms that do not scale linearly with the number of neurons (see equation 4.3). 4.5 Uniform Positive Correlations with Jitter. An especially conspicuous feature of experimentally obtained correlation coefcients is their large and seemingly random variability. For example, Maynard et al. (1999) nd correlation coefcients ranging from –0.9 to 0.9 in a population of MI neurons, and Zohary et al. (1994) report a similarly large span of values in monkey MT. Despite the fact that this large variability appears to be a general nding in studies of correlations in neural systems, there have been no attempts to relate it to the population’s encoding performance. In the following, the effect of nonuniform correlations is studied by using a covariance matrix with entries that contain a stochastic component. It will turn out that, on average, jitter in the correlation coefcient increases the Fisher information. The idea that random covariance matrices could lead to a better encoding performance is based on equation 2.7, which shows that the smallest eigenvalues of Q yield the largest contribution to Fisher information. As the smallest eigenvalue, lmin , approaches zero, Jij ( x ) increases overproportionally and obviously diverges at lmin D 0. Random matrices have eigenvalues that are distributed around the eigenvalues of the corresponding “mean” matrix. Thus, it is expected that eigenvalues smaller than those of the corresponding covariance matrix without jitter occur and lead to an increase of average Fisher information. To illustrate this argument, we consider a population in which the correlations between pairs of neurons are normally distributed with mean q. The corresponding covariance matrix reads
Qij ( x ) D qij afi ( x ) a fj ( x ) a
i, j D 1, . . . , N,
(4.6)
where qii D 1 for i D 1, . . . , N. The nondiagonal elements qij for i < j are drawn from a gaussian distribution with mean q and standard deviation s, while the symmetry of Qij requires qji D qij . For small s, the covariance matrix, equation 4.6, can be regarded as the uniform-correlation case from section 4.2 (see equation 4.1, with b D 0) with a small perturbation of order s2 . Expanding the inverse of the covariance matrix in terms of s2 and neglecting terms of higher order, this leads to an expansion formula for the Fisher information derived in section A.5, » 2 ¤ t £ s2 N Jij ( x ) D Jij0 ( x ) C Fij ( x I a) ¡ Gij ( x I a) (1 ¡ q ) 3 a ¼ £ ¤ C a2 Fij ( x I 1) ¡ Gij ( xI 1) . (4.7)
Fisher Information J/J(s=0)
Representational Accuracy of Stochastic Neural Populations
171
1,3
1,2
Analytical Approximation Numerical Result +/ Standard Deviation
1,1
1 0
0,01
0,02
SD s of Correlation Coefficient
0,03
Figure 5: Variability in the correlation strength increases coding accuracy. Mean Fisher information (solid) and analytical formula 4.7 (dashed) as a function of standard deviation (SD) s of the correlation coefcient distribution. Parameters: N D 50, g D 5 / s, D D 1, a D 1 / 2, q D 1 / 2, t F D 100, f D 0.1, 10,000 samples.
For D D 1, the correction term is always positive. Hence, for a given mean correlation coefcient q, a population of neurons with noisy correlation coefcients performs better on average than one with uniform correlations. This effect is demonstrated by a numerical simulation in Figure 5, where the mean Fisher information of 10,000 covariance matrices is plotted as a function of the standard deviation of the correlation coefcient. Figure 5 also shows that the range of validity of the expansion 4.7 is quite small. This is due to the N-dependence of the smallest eigenvalue, which is given p by lmin D 1 ¡ q ¡ 2 Ns in this case (see appendix A.5). For N D 50 and q D 0.5, this implies p that the mean Fisher information must diverge around ( ) smax D 1 ¡ q / 4N ¼ 0.035, explaining why the quadratic approximation, equation 4.7, becomes invalid as soon as s D 0.015. The numerical calculations indicate that the increase in Fisher information is even more pronounced than predicted by the expansion formula. The result that correlations with jitter increase the Fisher information can be interpreted as follows. A zero eigenvalue in the covariance matrix indicates the existence of a linear combination of single-neuron activities that is completely noiseless. This situation of course cannot be achieved in realistic systems. Thus, there must be limits to the choice of the covariance matrix in biological systems. However, the form of these restrictions remains unclear and has not been subject to empirical investigations yet.
172
Stefan D. Wilke and Christian W. Eurich
This lack of knowledge makes a systematic optimization of the covariance matrix by theoretical analysis impossible. One may therefore view the introduction of jitter in the correlation coefcients as a means of randomly exploring the space of covariance matrices, which, according to the above, yields better average encoding accuracy as compared to the case of uniform correlation strength. The range of validity of this argument is limited by the fact that the matrices eventually leave the admissible set. If there is a biological cost associated with q, for example, if the correlations are attained through (expensive) lateral connections, equation 4.7 states that random correlations lead to a better encoding performance than uniform correlations at a given biological cost. From this point of view, it appears natural for neural populations to introduce variability in the correlations between neurons. Theoretically, this result shows that uniform correlations are disadvantageous in an estimation-theoretical sense: the existence of noise in the correlation coefcients is likely to improve the representational capacity of the population. 5 Effect of Tuning Widths on the Encoding Accuracy
In the analysis of neural coding strategies, one major question has always been whether narrow tuning or broad tuning is advantageous for the representation of a set of stimulus features. Both cases are encountered empirically. For example, in the human retina, small receptive elds of diameters less than 1 degree visual angle have been found (Kufer, 1953), while the early visual system of tongue-projecting salamanders contains neurons with receptive eld diameters of up to 180 degrees (Wiggers, Roth, Eurich, & Straub, 1995). Further examples for broadly tuned neurons can be found in the auditory system of the barn owl (Knudsen & Konishi, 1978) and in the motor cortex of primates (Georgopoulos, Schwartz, & Kettner, 1986). On the theoretical side, the situation is also inconclusive. Arguments have been put forward in support of small (Lettvin, Maturana, McCulloch, & Pitts, 1959; Barlow, 1972) as well as large (Hinton et al., 1986; Baldi & Heiligenberg, 1988; Snippe & Koenderink, 1992a; Eurich & Schwegler, 1997) receptive elds. It turned out that the number of encoded stimulus features, D, plays an important role in this context. For populations of binary neurons, large receptive elds are advantageous for D > 1 (Hinton et al., 1986; Eurich & Schwegler, 1997). Stochastic neurons with graded responses were found to yield optimal performance with small receptive elds for D D 1, and with large receptive elds for D > 2 (Zhang & Sejnowski, 1999). However, all previous studies have considered only the case that all neurons have identical receptive eld shapes, with the same tuning width for all stimulus features. The situation may be different for neural populations that do not exhibit these properties. Equation 3.3 demonstrates that for the model population considered in this article, Fisher information depends
Representational Accuracy of Stochastic Neural Populations
173
on the tuning widths si , i D 1, . . . , D, via the simple scaling rule Jii / Q hsi¡2 D l D1 sl i, where h. . .i denotes the average over the tuning width distribution Ps dened in equation 2.4. Thus, in the current context, the analysis of more complex strategies of tuning widths as specied by Ps can be carried out separately from the other parameters. For instance, the conclusions drawn in the following are independent of the level of background noise f and the maximum ring rate F and are also unaffected by the existence of noise correlations (of any type) within the population. While arbitrary Ps could be considered in principle to optimize the population’s representational capacity, there is no clear picture on the biological restrictions on this function. Thus, the following section deals with three special cases that illustrate the expected effects. 5.1 Identical Tuning Widths Across the Population. A particularly simple choice for the tuning widths is to choose one width sN i for each of the D dimensions, that is,
Ps ( s1 , . . . , sD ) D
D Y i D1
d( si ¡ sN i ) .
(5.1)
This choice is schematically shown in Figure 6b for D D 2. In this case, one nds for the average population Fisher information QD Jii /
Nl l D1 s . sN i2
(5.2)
From equation 5.2, it becomes immediately clear that the representational accuracy of the ith stimulus feature depends on i; the accuracies for the stimulus features need not be the equal. Thus, the population can be “specialized” by choosing the tuning widths sN i such that a particularly high accuracy for only a subset of the represented features results (Eurich & Wilke, 2000). Note that equation 5.2 includes the situation of identical tuning widths for all features as a special case (see Figure 6a). For sN 1 D ¢ ¢ ¢ D sN D D s, N one has the scaling behavior Jii / sN D¡2 , a result already obtained by Zhang and Sejnowski (1999). Thus, in this case, narrow tuning curves are favorable for D D 1, and broad tuning curves improve the encoding accuracy for D > 2. For D D 2, Fisher information does not depend on the tuning width at all. An important limitation to this simple behavior exists for extremely narrow tuning. Assuming a xed number of neurons, the tuning curves will eventually become too narrow to cover the stimulus space without gaps as sN ! 0. The unrepresented areas appearing in the stimulus space are disastrous for the stimulus-averaged estimation error. Thus, the sN D¡2 -scaling law behavior of Fisher information will, for sN ! 0, eventually cross over
174
Stefan D. Wilke and Christian W. Eurich
Figure 6: Visualization of tuning width distributions with (c, d) and without (a, b) intrapopulation variability for the case D D 2. (a) Uniform tuning width sN 6 s for all features. (b) Separate tuning width sN i for feature i, sN 1 D N 2 . (c) Same as a, but sN may vary from neuron to neuron. Here, sN is uniformly distributed within a range b. (d) Tuning widths uniformly distributed in a (hyper-)cuboid of edge lengths b1 , . . . , bD around the mean sN 1 , . . . , sND .
into a sharp decrease. This implies the existence of an optimal value for sN in the case D D 1 and, by the same argument, of an optimal tuning width for sN i if only this tuning width is varied and D > 1 (Eurich & Wilke, 2000). Note that there are no optimal tuning widths for D ¸ 2, where Fisher information is monotonous for all values of s. N 5.2 Tuning Width Variability Across the Population. Experimentally measured tuning widths often exhibit a large variability across the population (see, e.g., Wiggers et al., 1995). In order to study the encoding with nonuniform tuning widths as compared to identical tuning widths as in the previous cases, the distribution
P s ( s1 , . . . , sD ) D
µ ³ ´¶ µ³ ´ ¶ D Y 1 bi bi H si ¡ sN i ¡ H sN i C ¡ si 2 2 b iD 1 i
(5.3)
is considered, where H denotes the Heaviside step function. Equation 5.3 describes a uniform distribution in a D-dimensional cuboid of size b1 , . . . , bD around the mean ( sN 1 , . . . , sN D ) (see Figure 6d for D D 2). Up to third order in bi , the Fisher information for a population with this kind of variability in the tuning widths is given by the formula in Figure 6d (Eurich et al., 2000). Comparing this result to the encoding with the mean receptive eld
Representational Accuracy of Stochastic Neural Populations
175
sizes given by equation 5.2, one nds that variability in the tuning width bi increases the encoding accuracy for feature i, while leaving the Fisher information of the other features unaffected. This results can also be understood 6 i) but a convex function of si . Conseby noting that Jii is linear in sj (for j D quently, for feature i, one gains more from the smaller tuning widths in the si -distribution than one loses from the larger ones, implying an increase of Fisher information Jii with bi . The two effects cancel each other for the other 6 i). tuning widths, so that Jii does not depend on bj (j D In contrast to equation 5.3, Zhang and Sejnowski (1999) consider the situation of a correlated variability of tuning widths. In their model, the tuning curves are always assumed to be radially symmetric—the tuning widths in all dimensions are identical (si D s; N see Figure 6c). For a distribution of tuning widths with this restriction, one nds an average population Fisher information / hsN D¡2 i, so that variability in sN does not improve the encoding for D D 2 or D D 3. In summary, at a given mean tuning width, a neural population with variability in the tuning widths yields a more precise representation than a population with uniform tuning widths. This statement holds for an arbitrary number of encoded features D. It fails, however, at the optimal tuning width introduced in the previous section. Here, any deviations from the optimal value reduce the encoding accuracy. However, we argue that biologically plausible neural populations do not use this mathematically optimal tuning width since it is associated with an extremely small overlap of tuning functions. The failure of a single neuron would leave an unrepresented gap in the stimulus space. In the case of larger redundancy, the above calculation is valid, and jitter in the tuning widths improves the encoding accuracy. Hence, our result gives a theoretical argument for the large tuning width variability found in biological systems. 5.3 Specialized Subpopulations. Finally, a more complicated choice of tuning widths is discussed to demonstrate that uniform tuning properties across the population are not optimal. The population is split up into D > 1 subpopulations that differ by their tuning properties. Starting from a uniform tuning width, s, N for all neurons and features, the ith tuning width in the ith subpopulation is set to ls, N with a parameter l > 0. The other tuning widths of the subpopulation are adjusted such that the receptive eld Q volume in stimulus space / D kD1 sk remains constant. Mathematically, this formation of subpopulations corresponds to the tuning width distribution
2 3 D ² Y ± 1 X 4d (si ¡ lsN ) d sj ¡ l1/ (1¡D) sN 5 . Ps ( s1 , . . . , sD ) D D i D1 6 i jD
(5.4)
Note that for l D 1, the population is uniform: all neurons have tuning 6 width sN for all stimulus dimensions. For l D 1, the population is split up
Stefan D. Wilke and Christian W. Eurich
Fisher Information J/J(l=1)
176
5 D=2 D=3 D=4 D=10
4 3 2 1 0 0
1
2
3
4
Subpopulation Asymmetry l
5
Figure 7: Fragmentation into D subpopulations. Fisher information increase due to splitting of the population into D subpopulations from equation 5.5, as a function of the splitting parameter l.
into D subpopulations; in subpopulation i, si is different from the other tuning widths sj . The Fisher information for this population is given by Jii /
( D ¡ 1) l2D / ( D¡1) C 1 . Dl2
(5.5)
It does not depend on i because of the symmetry in the subpopulations. The l-dependent factor that quanties the change of encoding accuracy as compared to the uniform population is plotted in Figure 7 for different values of D. It turns out that the uniform case l D 1 is a local minimum of Fisher information: uniform tuning properties yield the worst encoding performance within the model class. Any other value of l—any unsymmetrical receptive eld shape for the subpopulations—leads to a more precise stimulus encoding. The increase of Fisher information for smaller l reects the better encoding with specialized subpopulations for smaller si , as discussed in section 5.1. In this limit, the ith subpopulation exclusively encodes feature i. For l À 1, on the other hand, the ith subpopulation encodes all features but xi . The Fisher information loss that results from decreasing the corresponding tuning widths is compensated by the increase of si . The fact that the approximation 5.5 diverges for l ! 0 and for l ! 1 is due to the idealization of assuming an innite population size and an innite stimulus space. Under these assumptions, the tuning curves will be
Representational Accuracy of Stochastic Neural Populations
177
strongly deformed if l approaches extreme values, but the number of tuning curves that cover a given point in stimulus space always remains constant. In a nite neural population, on the other hand, the coverage decreases for large or small l for two reasons. First, for some features, the tuning curves become narrower, so that neighboring neurons in these directions cease to encode the stimulus. Second, in the other directions, where the tuning curves become broader, no new tuning curves reach the stimulus position since the corresponding tuning curve centers would lie outside the boundaries of the population. Consequently, Fisher information eventually decreases as l ! 0 or l ! 1 if the number of neurons is nite. The value of l at which this effect sets in depends on the initial tuning widths s. N The larger s, N the sooner a considerable fraction of receptive elds will lie outside the stimulus space. A second effect that is expected is that values for l that are too large or too small will lead to unrepresented gaps in stimulus space, thus yielding a breakdown of encoding quality by the mechanism described in section 5.1. In contrast to the rst effect, this breakdown occurs sooner if sN is smaller, since at given g, this implies a less dense coverage of stimulus space with tuning curves. These arguments are illustrated by an example in Figure 8. The initial deviation from the analytical curve can be attributed to the rst effect; it is strongest for large s. N The drop of Fisher information of the nite population at large or small l is due to unrepresented areas in stimulus space; it is most severe for the smallest value of s. N Thus, our analysis leads to the prediction that for a nite-size neural population and a bounded stimulus space, there is an optimal level of specialization in terms of tuning curve nonuniformity. 6 Conclusion
We have presented a Fisher information analysis of the representational accuracy achieved by a population of stochastically spiking neurons that encode a stimulus with their spike counts during some xed time interval. It turned out that the type of neuronal noise—the response variability— has a strong inuence on the population Fisher information. It not only determines the increase of representational accuracy with increasing maximum spike count, but also governs how vulnerable the encoding scheme is to a high baseline ring level. Additive gaussian noise is very resistant, while Poissonian noise is disadvantageous, especially if many stimulus features are encoded simultaneously. The effect of noise correlations was also shown to depend on the response variability. The result that limited-range correlations can severely limit the effective population size was found to be a special feature of additive gaussian noise. In addition, we analyzed the inuence of random variations in the correlation coefcient. This yielded congurations with uniform correlation strength are disadvantageous in the sense that at xed mean correlation, variations improve the encoding on average. Finally, we discussed the effects of the tuning widths on the
Stefan D. Wilke and Christian W. Eurich
Fisher Information J/J(l=1)
178
15 Analytical s = 0.05 s = 0.075 s = 0.1
10
5
0
0,1
1
10
Subpopulation Asymmetry l Figure 8: Formation of subpopulations in a nite-size neural population. Analytical approximation (see equation 5.5) (solid) and mean Fisher information of N D 1000 neurons uniformly distributed in the stimulus space [0, 1]2 for sN D 0.05 (dashed), sN D 0.075 (dot-dashed), and sN D 0.1 (dotted). Fisher information was calculated at the middle of stimulus space and averaged over 5 £105 choices of the tuning curve positions. The apparent symmetry is due to the fact that in equation 5.4, l is equivalent to l¡1 for D D 2. Note the logarithmic scale for l in this gure.
encoding accuracy. The case of radially symmetric receptive elds, which attracts most attention in the literature, yielded a worse encoding accuracy than all other cases we studied. Both variations of tuning widths within the population and the fragmentation of the population into subpopulations (with respect to tuning properties) improved the representational capacity. The latter case may be the reason for the existence of neural subpopulations specializing in certain sensory features, such as in the visual system (Livingstone & Hubel, 1988). Taken together, these results lead to three main conclusions. First, the structure of neuronal noise can substantially modify the encoding properties of neural systems. Empirically, most studies have reported that the spike count variance follows a power law of the mean spike count. We showed that the theoretical representational capacity strongly depends on the power law exponent in this case. This shows that choosing the correct response model can be critical for theoretical analysis. Second, the considerations on the parameter variability lead to the hypothesis that the great variability may not simply be a by-product of neuronal diversity but could be exploited by
Representational Accuracy of Stochastic Neural Populations
179
the neural system to achieve better encoding performance. Finally, neural populations can choose from a wide variety of strategies to optimize their tuning properties. This may even occur in a dynamical or task-dependent way (Rolls, Treves, & Tovee, 1997; Worg¨ ¨ otter et al., 1998). Hence, the question of optimal tuning properties may not be reduced to a simple “broad or narrow” dichotomy. Appendix A: Fisher Information for Neural Populations with Gaussian Spike Count Statistics
In this appendix, we present the calculation of Fisher information for several variants of gaussian spike count statistics. The starting point is equation 2.7, into which the covariance matrix, equation 4.1, is inserted. An analytical form for Jij ( x ) is derived for two special cases only: Uniform correlations (see section A.1) and limited-range correlations for D D 1 (see section A.3). On the basis of these explicit formulas, expansions can be formed (see sections A.4 and A.5). For simplicity, the x -dependence of the tuning curves fk ( x ) and of the covariance matrix Q( x ) is not noted explicitly in the following. A.1 Uniform Correlations. For uniform correlations, b D 0, inverse and derivative of the covariance matrix, equation 4.1, read
1 d kl ( Nq C 1 ¡ q) ¡ q , afka fla ( 1 ¡ q) ( Nq C 1 ¡ q)
( Q ¡1 ) kl D
³
@Q @xj ( j)
´
( j)
kl
( j)
D a( g k C gl ) Qkl ,
(A.1) (A.2)
@f
where g k :D 1fk @xkj . Inserting this into equation 2.7, the rst term of Fisher information Jij ( x ) becomes " # N N X X t2 q ( i) ( j) ( i) ( j) (A.3) h h ¡ hk hl ( Nq C 1 ¡ q) a ( 1 ¡ q ) kD 1 k k k,l D 1 µ ¶ t2 qN (A.4) D Fij ( x I a) ¡ Gij ( x I a) a (1 ¡ q ) Nq C 1 ¡ q ( j)
with the abbreviations hk :D Fij ( x I a) :D Gij ( x I a) :D
1 @fk , fka @xj
N X @f k 1 @f k ¡! g £ const., @ xi fk2a @xj kD 1 N 1 X @f k 1 @fl ¡! 0. N k,lD 1 @xi fka fla @xj
(A.5)
(A.6)
180
Stefan D. Wilke and Christian W. Eurich
The expressions on the right indicate the behavior of Fij ( x I a) and Gij ( xI a) in the continuous limit (see section A.2). Note that the Cauchy-Schwarz inequality implies Fii ( x I a) ¸ Gii ( x I a) . The second term in equation 2.7 is given by N X N X N X N a2 X ( j) ( j) ( i) ( i) Q kl ( g k C gl ) ( Q ¡1 ) lm Qmn ( gm C gn ) ( Q ¡1 ) nk. (A.7) 2 kD 1 l D 1 m D 1 n D 1
Expanding the sums and canceling one of the inverse matrices, this becomes a2
N X k,l D 1
( ) ( j)
i Q kl gk gk ( Q ¡1 ) lk C a2
N X
( ) ( j)
i Qkl g k gl ( Q ¡1 ) lk.
(A.8)
k,l D 1
P ( i) ( j) 2 The rst sum yields a2 N kD 1 g k g k D a Fij ( x I 1) , and the second is simpli¡1 ed using the identity Q kl ( Q ) lk D [d kl ( Nq C 1 ¡ 2q C q2 ) ¡q2 ]( 1 ¡ q) ¡1 ( Nq C 1 ¡ q) ¡1 , which results in a2 Fij ( x I 1) C a2
Nq C 1 ¡ 2q C q2 F ( x I 1) ( 1 ¡ q) ( Nq C 1 ¡ q) ij
q2 NGij ( xI 1) (1 ¡ q ) ( Nq C 1 ¡ q ) qN[(2 ¡ q ) Fij ( x I 1) ¡ qGij ( x I 1) ] C 2( 1 ¡ q) 2 Fij ( x I 1) D a2 (1 ¡ q ) ( Nq C 1 ¡ q ) ¡ a2
(A.9) (A.10)
for the trace term of the Fisher information. Combining equations A.4 and A.10 and assuming N À 1, one arrives at equation 4.2. As a special case of equations A.4 and A.10, one nds for large uncorrelated (q D 0) neural populations equation 3.2 where the limit N À 1 must be taken after setting q D 0. A.2 Continuous Limit. The analysis of the equations derived for the Fisher information in section A.1 is simplied considerably in the continuous limit, where it is assumed that
Z Fij ( x I a) D
dD c g( c ) *Q
D dijg
@fc 1 @fc @xi fc2a @xj
D r D1 sr
si sj
+
Z dDc
¼g
4p D/ 2 F2¡2a C (1 C D2 )
Z1
@fc 1 @fc @xi fc2a @xj
dj j D C 1
w 0 (j 2 ) 2 . w (j 2 ) 2a
(A.11)
(A.12)
0
In equation A.12, h. . .i denotes the average over the distribution of tuning widths, Ps , and fc is the tuning curve of a neuron with receptive eld
Representational Accuracy of Stochastic Neural Populations
181
center the derivation of equation have used the identity R 1 Dat c . In R A.12, we 2 )j 2 ¡1 1 2 )j D C 1 D/ 2 C( 1 C (j (j d j p 2) dj for g ( z ) D w 0 ( z) 2 / g D D g / ¡1 0 2a w ( z ) . By a similar calculation and the fact that the tuning functions are symmetrical, one nds that Gij ( x I a) D 0 in this limit. The continuous limit is reached if the distribution of neurons in stimulus space, g, is sufciently dense and uniform (Zhang & Sejnowski, 1999). The result, equation 3.3, is identical to equation A.12, except that the integral is abbreviated as Aw ( a, D) . For additive noise (a D 0) and gaussian tuning curves, w ( z ) D f C ( 1 ¡ f ) exp( ¡z) , it becomes Aw ( a D 0, D) D (1 ¡ f ) 2 C ( 1 C D2 ) / 8, so that *Q Fij ( xI a D 0) D dijg
D rD 1 sr
si sj
+
p D / 2 F2 ( 1 ¡ f ) 2 , 2
(A.13)
and equation 3.4 follows immediately. A.3 Limited-Range Correlations. For limited-range correlations in one stimulus dimension, that is, D D 1, q D 0, and b D 1, the covariance matrix, equation 4.1, can be written as Q kl D r |i¡j| afk fl if the tuning functions are equidistant and r :D exp[¡1 / ( Lg ) ]. In this case, its inverse is given by
(Q
¡1
) kl
µ ¶ 1 C r2 r (d kC 1,l C d k¡1,l ) , d kl ¡ D 1 C r2 afka fla 1 ¡ r 2 1
(A.14)
and its derivative with respect to x1 D x is again of the form of equation A.2. Inserting this into equation 2.7, the rst term of the Fisher information J11 ( x ) becomes t2
X (1) (1) 1 C r2 2t 2r N¡1 ( a) ¡ F xI h h , 2 2 a (1 ¡ r ) a ( 1 ¡ r ) kD 1 k k C 1
(A.15)
where F ( xI a) :D F11 ( xI a). The sum in the second term of equation A.15 is equal to F ( xI a) ¡ H ( xI a) / 2, where H ( xI a) :D
N¡1 X kD 1
³
1 d fk 1 d fkC1 ¡ a a fk dx fkC 1 dx
´2 ¡! g ¡1 £ const.
(A.16)
Again, the expression on the right indicates the asymptotic behavior for large N / g. Hence, equation A.15 can be written as t2
1¡r t 2rH( xI a) . F ( xI a) C a (1 C r ) a (1 ¡ r 2 )
(A.17)
182
Stefan D. Wilke and Christian W. Eurich
In the second term of equation 2.7, the calculation proceeds as in the uniform correlation case (cf. equations A.7 and A.8) with i D j D 1, only that ¡1 ) D d kl ( 1 C r 2 ) / ( 1 ¡ r 2 ) ¡ r 2 (d kC 1,l C d k¡1,l ) / ( 1 ¡ r 2 ) yields Qkl ( Qlk 2 N¡1 X (1) (1) 1 C r2 2 2r ( 1) ¡ a F xI h h 2 2 1¡r 1 ¡ r kD 1 k k C 1 ³ ´ r 2 H ( xI a) 2 D a 2F( xI 1) C 1 ¡ r2
a2 F ( xI 1) C a2
(A.18)
instead of equation A.10. Combining the two terms, A.17 and A.18, one nds the result, equation 4.3. A.4 Uniform and Limited-Range Correlations. For b ¿ 1 and r ¿ 1, the covariance matrix, equation 4.1, can be written as Q D R C 2 S, where Rkl D [d kl C q( 1 ¡ d kl ) ]afka fla , S kl D (d kC 1,l C d k¡1,l ) afka fla , and 2 :D br ¿ 1. To obtain an explicit expression for Q¡1 , one iteratively applies the matrix relation Q ¡1 D R¡1 ¡ 2 R ¡1 SQ¡1 and nds the general expansion formula,
( R C 2 S ) ¡1 D R¡1 ¡ 2 R¡1 SR ¡1 C 2 2 R ¡1 SR ¡1 SR ¡1 C O ( 2
3)
.
(A.19)
Since the inverse of R is explicitly known (see equation A.1), equation A.19 can be used to obtain the Fisher information J11 ( x) from equation 2.7 to arbitrary order in br. The matrix R¡1 SR ¡1 can be calculated explicitly; it yields a correction to rst term of equation 2.7 that approaches ¡ brt 2
dfT ¡1 ¡1 df 2t 2 [F( x, a) ¡ G ( x, a)] R SR D ¡br dx dx a( 1 ¡ q ) 2
(A.20)
for large N. The correction to the second term requires some more work. First, note that the derivative of Q is given by equation A.2. Inserting the expansion A.19 into the trace and exploiting its cyclic invariance, one nds the two corrections, h i h i ¡ br Tr R¡1 SR ¡1 R0 R¡1 R0 C br Tr R¡1 S0 R¡1 R0 , (A.21) in leading order. Evaluating the matrix R¡1 SR ¡1 R0 R ¡1 R0 explicitly and taking the trace, one has for large N, h i 2a2 q [F(xI 1) ¡ G( xI 1) ] ¡ br Tr R¡1 SR ¡1 R0 R¡1 R0 D ¡br , ( 1 ¡ q) 2
(A.22)
which grows linearly with N. The second trace term in equation A.21 yields h i ¡8a2 F ( xI 1) ¡brTr R ¡1 S0 R¡1 R0 D ¡br ( 1 ¡ q) N
(A.23)
Representational Accuracy of Stochastic Neural Populations
183
for large N, which approaches a constant and can therefore be neglected. Combining the equations A.20 and A.22 with the (b D 0)-result, equation 4.3, derived in section A.3, one arrives at equation 4.5. A.5 Uniform Correlations with Jitter. In order to analyze noisy uniform correlations of strength q, one assumes for the covariance matrix Q D R C sS, where R is as above, S kl D skl afk fl , and the matrix elements skl for k < l are independent and identically distributed gaussian random variables with zero mean and unity standard deviation, while s kk D 0. Choosing s kl D slk for k > l ensures that Q is symmetric. This yields the relations
hs kl i D 0,
hskl smn i D ( 1 ¡ d kldmn ) (d kmdln C d kndlm ) .
(A.24)
The matrix expansion formula, A.19, is now employed to nd the inverse of Q. If the resulting standard deviation s is small compared to q, one has for the rst term of Fisher information, equation 2.7, the correction t 2 s2
N[Fij ( xI a) ¡ Gij ( xI a) ] @f T ¡1 @f R hSR ¡1 SiR ¡1 D t 2 s2 , @xi @xj a( 1 ¡ q ) 3
(A.25)
which is computed using equation A.24 and the explicit form of R. h. . .i denotes the average over the joint probability distribution of the matrix elements of S. Inserting equation A.19 into the trace term in equation 2.7 and sorting by orders of s, the leading-order correction is found to be ½
s2 h 0 ¡1 0 ¡1 ¡1 ¡1 Tr 2R R R R SR SR C R0 R¡1 SR ¡1 R0 R ¡1 SR ¡1 2 ¡ 2R0 R¡1 S0 R ¡1 SR ¡1 ¡ 2R0 R¡1 SR ¡1 S0 R¡1
(A.26) iE C S0 R ¡1 S0 R¡1 ,
where the cyclic invariance of the trace was used. Inserting R explicitly and making use of the relations A.24, the ve trace terms become 2s2 a2 N 2
µ
(4 ¡ 3q) Fij ( x I 1) ¡ qGij ( x I 1) 2Gij ( x I 1) C 3 (1 ¡ q ) ( 1 ¡ q) 2 ¶ Fij ( x I 1) C Gij ( x I 1) C ( ¡2 ¡ 2 C 1) ( 1 ¡ q) 2
(A.27)
for large N. Combining this result with the rst correction, equation A.25, one arrives at the expansion 4.7. Appendix B: Eigenvalue Density of a Covariance Matrix with Jitter
In this section we derive the eigenvalue density of the covariance matrix R C sS discussed in section A.5. The additional factor afk fl that is dropped
184
Stefan D. Wilke and Christian W. Eurich
here does not inuence the appearance of a zero eigenvalue, so that the conclusion drawn in section 4.5 will hold for the full covariance matrix. The distribution of the diagonal elements of S is irrelevant for the eigenvalue density in the limit of large N (e.g., Mehta, 1991). Thus, the eigenvalue density of the matrix ensemble R C sS remains unaltered if the diagonal elements S kk are turned p into independent gaussian random variables with standard deviation 2s. In the latter case, R C sS represents a deformed gaussian orthogonal ensemble (Brody et al., 1981). Its eigenvalue density, r ( l), can be determined via the relations (Pastur, 1972) µ ¶ 1 1 1 r ( l) D ¡ Im[g( l) ], Tr . (B.1) g ( l) D p l ¡ R ¡ s2 Ng( l) N In the example considered here, the eigenvalues of the matrix R are ( N ¡ 1)q C 1, and N ¡ 1 times 1 ¡ q. Thus, the second equation in equation B.1 becomes g (l) D
1 1 N l ¡ [( N ¡ 1) q C 1] ¡ s2 Ng( l) 1 N¡1 C , N l ¡ ( 1 ¡ q) ¡ s2 Ng(l)
which yields the solution µ ¶ q 1 2 ¡ 4Ns2 (1 ) ( ) [l ] l ¡ ¡ ¡ 1 ¡ g (l) D q § q 2Ns2
(B.2)
(B.3)
in the limit of large N, the minus sign resulting a semicircular mean eigenvalue density (Wigner, 1957), 8 1 q > 4Ns2 ¡ [l ¡ ( 1 ¡ q) ]2 < 2 p r ( l) D 2p Ns for |l ¡ ( 1 ¡ q) | · 2 Ns (B.4) > : 0 otherwise. The correction to r ( l) associated with the single eigenvalue ( N ¡ 1) q C 1 is of order 1 / N and concerns larger eigenvalues À 1 ¡ q only (except for the degenerate case q D 0, where equation B.3 follows from B.2 without further approximations). The minimum eigenvalue is at the left p end of the semicircle described by equation B.4, that is, lmin D 1 ¡ q ¡ 2 Ns. Acknowledgments
We thank M. Bethge, K. Pawelzik, H. Schwegler, and M. Wezstein for helpful discussions and the anonymous reviewers for insightful comments on an earlier version of the manuscript for this article. This work was supported by Deutsche Forschungsgemeinschaft, SFB 517.
Representational Accuracy of Stochastic Neural Populations
185
References Abbott, L. F., & Dayan, P. (1999). The effect of correlated variability on the accuracy of a population code. Neural Computation, 11, 91–101. Bair, W., & O’Keefe, L. P. (1998). The inuence of xational eye movements on the response of neurons in area MT of the macaque. Visual Neuroscience, 15, 779–786. Baldi, P., & Heiligenberg, W. (1988). How sensory maps could enhance resolution through ordered arrangements of broadly tuned receivers. Biological Cybernetics, 59, 313–318. Barlow, H. B. (1972).Single units and sensation: A neuron doctrine for perceptual psychology? Perception, 1, 371–394. Bethge, M., Rotermund, D., & Pawelzik, K. (2000). Optimal short-term population coding: When Fisher information fails. Manuscript submitted for publication. Brody, T. A., Flores, J., French, J. B., Mello, P. A., Pandey, A., & Wong, S. S. M. (1981). Random-matrix physics: Spectrum and strength uctuations. Reviews of Modern Physics, 53(3), 385–479. Brunel, N., & Nadal, J.-P. (1998). Mutual information, Fisher information, and population coding. Neural Computation, 10, 1731–1757. Callen, H. B. (1985). Thermodynamics and an introduction to thermostatistics (2nd ed). New York: Wiley. Cardoso de Oliveira, S., Thiele, A., & Hoffmann, K.-P. (1997). Synchronization of neuronal activity during stimulus expectation in a direction discrimination task. Journal of Neuroscience, 17(23), 9248–9260. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. CramÂer, H. (1946). Mathematical methods of statistics. Princeton, NJ: Princeton University Press. Dan, Y., Alonso, J.-M., Usrey, W. M., & Reid, R. C. (1998). Coding of visual information by precisely correlated spikes in the lateral geniculate nucleus. Nature Neuroscience, 1(6), 501–507. Dean, A. F. (1981). The variability of discharge of simple cells in the cat striate cortex. Experimental Brain Research, 44, 437–440. DeAngelis, G. C., Ghose, G. M., Ohzawa, I., & Freeman, R. D. (1999). Functional micro-organization of primary visual cortex: Receptive eld analysis of nearby neurons. Journal of Neuroscience, 19(9), 4046–4064. deCharms, R. C., & Zador, A. (2000). Neural representation and the cortical code. Annual Reviews of Neuroscience, 23, 613–647. Deneve, S., Latham, P. E., & Pouget, A. (1999). Reading population codes: A neural implementation of ideal observers. Nature Neuroscience, 2(8), 740–745. Eckhorn, R., Grusser, ¨ O.-J., Kroller, ¨ J., Pellnitz, K., & Popel, ¨ B. (1976). Efciency of different neural codes: Information transfer calculations for three different neuronal systems. Biological Cybernetics, 22, 49–60. Eurich, C. W., & Schwegler, H. (1997). Coarse coding: Calculation of the resolution achieved by a population of large receptive eld neurons. Biological Cybernetics, 76, 357–363.
186
Stefan D. Wilke and Christian W. Eurich
Eurich, C. W., & Wilke, S. D. (2000). Multi-dimensional encoding strategy of spiking neurons. Neural Computation, 12(7), 1519–1529. Eurich, C. W., Wilke, S. D., & Schwegler, H. (2000). Neural representation of multidimensional stimuli. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems,12 (pp. 115–121). Cambridge, MA: MIT Press. Feng, J., & Brown, D. (2000). Impact of correlated inputs on the output of the integrate-and-re model. Neural Computation, 12, 671–692. Gautrais, J., & Thorpe, S. (1998). Rate coding versus temporal order coding: A theoretical approach. BioSystems, 48, 57–65. Gawne, T. J., & Richmond, B. J. (1993). How independent are the messages carried by adjacent inferior temporal cortical neurons? Journal of Neuroscience, 13(7), 2758–2771. Georgopoulos, A. P., Schwartz, A. B., & Kettner, R. E. (1986). Neuronal population coding of movement direction. Science, 233, 1416–1419. Gershon, E. D., Wiener, M. C., Latham, P. E., & Richmond, B. J. (1998). Coding strategies in monkey V1 and inferior temporal cortices. Journal of Neurophysiology, 79, 1135–1144. Ghose, G. M., Ohzawa, I., & Freeman, R. D. (1994). Receptive-eld maps of correlated discharge between pairs of neurons in the cat’s visual cortex. Journal of Neurophysiology, 71(1), 330–346. Gur, M., Beylin, A., & Snodderly, D. M. (1997). Response variability of neurons in primary visual cortex (V1) of alert monkeys. Journal of Neuroscience, 17(8), 2914–2920. Hatsopoulos, N. G., Ojakangas, C. L., Paninski, L., & Donoghue, J. P. (1998). Information about movement direction obtained from synchronous activity of motor cortical neurons. Proceedings of the National Academy of Sciences USA, 95, 15706–15711. Heller, J., Hertz, J. A., Kjaer, T. W., & Richmond, B. J. (1995). Information ow and temporal coding in primate pattern vision. Journal of Computational Neuroscience, 2, 175–193. Hinton, G. E., McClelland, J. L., & Rumelhart, D. E. (1986). Distributed representations. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing (Vol. 1, pp. 77–109). Cambridge, MA: MIT Press. Johnson, D. H., Gruner, C. M., Baggerly, K., & Seshagiri, C. (2001). Informationtheoretic analysis of neural coding. Journal of Computational Neuroscience, 10, 47–69. Johnson, K. O. (1980). Sensory discrimination: Neural processes preceding discrimination decision. Journal of Neurophysiology, 43(6), 1793–1815. Karbowski, J. (2000). Fisher information and temporal correlations for spiking neurons with stochastic dynamics. Physical Review E, 61(4), 4235–4252. Kay, S. M. (1993). Fundamentals of statistical signal processing: Estimation theory. Englewood Cliffs, NJ: Prentice Hall. Knudsen, E. I., & Konishi, M. (1978). A neural map of auditory space in the owl. Science, 200, 795–797. Kufer, S. W. (1953). Discharge patterns and functional organization of the mammalian retina. Journal of Neurophysiology, 16, 37–68.
Representational Accuracy of Stochastic Neural Populations
187
Laughlin, S. B., de Ruyter van Steveninck, R., & Anderson, J. C. (1998). The metabolic cost of neural information. Nature Neuroscience, 1(1), 36–41. Lee, D., Port, N. L., Kruse, W., & Georgopoulos, A. P. (1998). Variability and correlated noise in the discharge of neurons in motor and parietal areas of the primate cortex. Journal of Neuroscience, 18(3), 1161–1170. Lettvin, J. Y., Maturana, H. R., McCulloch, W. S., & Pitts, W. H. (1959). What the frog’s eye tells the frog’s brain. Proceedings of the Institution of Radio Engineers, 47, 1940–1951. Levy, W. B., & Baxter, R. A. (1996). Energy efcient neural codes. Neural Computation, 8, 531–543. Livingstone, M. S., & Hubel, D. H. (1988). Segregation of form, color, movement, and depth: Anatomy, physiology, and perception. Science, 240, 740. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Maynard, E. M., Hatsopoulos, N. G., Ojakangas, C. L., Acuna, B. D., Sanes, J. N., Normann, R. A., & Donoghue, J. P. (1999). Neuronal interactions improve cortical population coding of movement direction. Journal of Neuroscience, 19(18), 8083–8093. Mehta, M. L. (1991). Random matrices (2nd ed.). San Diego, CA: Academic Press. Oram, M. W., Foldi ¨ ak, P., Perrett, D. I., & Sengpiel, F. (1998). The “ideal homunculus”: Decoding neural population signals. Trends in Neurosciences, 21, 259–265. Panzeri, S., Biella, G., Rolls, E. T., Skaggs, W. E., & Treves, A. (1996). Speed, noise, information and the graded nature of neuronal responses. Network: Computation in Neural Systems, 7, 365–370. Panzeri, S., Schultz, S. R., Treves, A., & Rolls, E. T. (1999a). Correlations and the encoding of information in the nervous system. Proceedings of the Royal Society of London B, 266, 1001–1012. Panzeri, S., Treves, A., Schultz, S., & Rolls, E. T. (1999b). On decoding the responses of a population of neurons from short time windows. Neural Computation, 11, 1553–1577. Paradiso, M. A. (1988). A theory of the use of visual orientation information which exploits the columnar structure of striate cortex. Biological Cybernetics, 58, 35–49. Pastur, L. A. (1972). On the spectrum of random matrices. Theoretical and Mathematical Physics, 10, 67–74. (Russian original: Teoreticheskayai Matematicheskaya Fizika, 10(1), 102–112, 1972.) Pouget, A., Deneve, S., Ducom, J.-C., & Latham, P. E. (1999). Narrow vs. wide tuning curves: What’s best for a population code? Neural Computation, 11, 85–90. Rao, C. R. (1945). Information and accuracy obtainable in the estimation of statistical parameters. Bulletin of the Calcutta Mathematical Society, 37, 81–91. Reich, D. S., Victor, J. D., Knight, B. W., Ozaki, T., & Kaplan, E. (1997). Response variability and timing precision of neuronal spike trains in vivo. Journal of Neurophysiology, 77, 2836–2841. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press.
188
Stefan D. Wilke and Christian W. Eurich
Rolls, E. T., Treves, A., & Tovee, M. J. (1997). The representational capacity of the distributed encoding of information provided by populations of neurons in primate temporal visual cortex. Experimental Brain Research, 114, 149– 162. Seung, H. S., & Sompolinsky, H. (1993). Simple models for reading neuronal population codes. Proceedings of the National Academy of Sciences USA, 90, 10749–10753. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. Journal of Neuroscience, 18(10), 3870–3896. Snippe, H. P., & Koenderink, J. J. (1992a). Discrimination thresholds for channelcoded systems. Biological Cybernetics, 66, 543–551. Snippe, H. P., & Koenderink, J. J. (1992b).Information in channel-coded systems: Correlated receivers. Biological Cybernetics, 67, 183–190. Softky, W. R., & Koch, C. (1993). The highly irregular ring of cortical cells is inconsistent with temporal integration of random EPSPs. Journal of Neuroscience, 13, 334–350. Steinmetz, P. N., Roy, A., Fitzgerald, P. J., Hsiao, S. S., Johnson, K. O., & Niebur, E. (2000). Attention modulates synchronized neuronal ring in primate somatosensory cortex. Nature, 404, 187–190. Teich, M. C., Johnson, D. H., Kumar, A. R., & Turcott, R. G. (1990). Rate uctuations and fractional power law noise recorded from cells in the lower auditory pathway of the cat. Hearing Research, 46, 41–52. Teich, M. C., & Khanna, S. M. (1985). Pulse number distribution for the neural spike train in the cat’s auditory nerve. Journal of the Acoustical Society of America, 77, 1110–1128. Tolhurst, D. J., Movshon, J. A., & Dean, A. F. (1983). The statistical reliability of signals in single neurons in cat and monkey visual cortex. Vision Research, 23, 775–785. Tolhurst, D. J., Movshon, J. A., & Thompson, I. D. (1981). The dependence of response amplitude and variance of cat visual cortical neurones on stimulus contrast. Experimental Brain Research, 41, 414–419. TovÂee, M. J., Rolls, E. T., Treves, A., & Bellis, R. P. (1993). Information encoding and the responses of single neurons in the primate temporal visual cortex. Journal of Neurophysiology, 70(2), 640–654. van Kan, P. L. E., Scobey, R. P., & Gabor, A. J. (1985). Response covariance in cat visual cortex. Experimental Brain Research, 60, 559–563. Vogels, R., Spileers, W., & Orban, G. A. (1989). The response variability of striate cortical neurons in the behaving monkey. Experimental Brain Research, 77, 432–436. White, J. A., Rubinstein, J. T., & Kay, A. R. (2000). Channel noise in neurons. Trends in Neurosciences, 23(3), 131–137. Wiggers, W., Roth, G., Eurich, C., & Straub, A. (1995). Binocular depth perception mechanisms in tongue-projecting salamanders. Journal of Comparative Physiology A, 176, 365–377. Wigner, E. P. (1957). Characteristic vectors of bordered matrices with innite dimensions. II. Annals of Mathematics, 65, 203–207.
Representational Accuracy of Stochastic Neural Populations
189
Wilke, S. D., & Eurich, C. W. (1999). What does a neuron talk about? In M. Verleysen (Ed.), Proceedings of the European Symposium on Articial Neural Networks 1999 (pp. 435–440). Brussels: D Facto. Wilke, S. D., & Eurich, C. W. (2001). Neural spike statistics modify the impact of background noise. Neurocomputing, 38–40, 445–450. Worg ¨ otter, ¨ F., Suder, K., Zhao, Y., Kerscher, N., Eysel, U. T., & Funke, K. (1998). State-dependent receptive-eld restructuring in the visual cortex. Nature, 396, 165–168. Yoon, H., & Sompolinsky, H. (1999). The effect of correlations on the Fisher information of population codes. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11. Cambridge, MA: MIT Press. Zemel, R. S., & Hinton, G. E. (1995). Learning population codes by minimizing description length. Neural Computation, 7, 549–564. Zhang, K., & Sejnowski, T. J. (1999). Neuronal tuning: To sharpen or broaden? Neural Computation, 11, 75–84. Zohary, E., Shadlen, M. N., & Newsome, W. T. (1994). Correlated neuronal discharge rate and its implications for psychophysical performance. Nature, 370, 140–143. Received September 13, 2000; accepted April 2, 2001.
LETTER
Communicated by Suzanna Becker
Supervised Dimension Reduction of Intrinsically Low-Dimensional Data Nikos Vlassis
[email protected] RWCP, Autonomous Learning Functions SNN, University of Amsterdam, The Netherlands Yoichi Motomura
[email protected] Electrotechnical Laboratory, Tsukuba Ibaraki 305-8568, Umezono 1-1-4, Japan Ben Krose ¨
[email protected] RWCP, Autonomous Learning Functions SNN, University of Amsterdam, The Netherlands High-dimensional data generated by a system with limited degrees of freedom are often constrained in low-dimensional manifolds in the original space. In this article, we investigate dimension-reduction methods for such intrinsically low-dimensional data through linear projections that preserve the manifold structure of the data. For intrinsically onedimensional data, this implies projecting to a curve on the plane with as few intersections as possible. We are proposing a supervised projection pursuit method that can be regarded as an extension of the single-index model for nonparametric regression. We show results from a toy and two robotic applications. 1 Introduction
Suppose one is given the data shown in Figure 1a. These are 200 threedimensional points sampled from a parameterized conic-like spiral, f ( s) D [(12p ¡ 2s) cos s, ( 12p ¡ 2s) sin s, ( s ¡ 2p ) 2 ]T ,
(1.1)
plus gaussian noise of zero-mean and unit variance. The parameter s is evenly spaced in [0, 5p ], while the quadratic form of the third component makes the curve fold itself after s D 2p . Trying a random orthogonal projection of this data set onto the plane, one observes that in most cases, the resulting curve is nonsimple—it intersects itself. A curve that does not intersect itself is called a simple curve Neural Computation 14, 191–215 (2001)
° c 2001 Massachusetts Institute of Technology
192
N. Vlassis, Y. Motomura, and B. Krose ¨
(a)
80 60 40 20 0 20
0
20
(b)
2
20
1
0
0
20 (c)
2
1
1
1 2
0
2 1
0
1
2
2
1
0
1
2
Figure 1: (a) Three-dimensional spiral data. (b) Projection on the rst two principal components. (c) Projection to a simple curve with the proposed model, hy D 0.22 (const), hs D 0.36 (const).
(Spivak, 1979). Even projections that are optimal according to criteria like maximum variance of the projected data can lead to nonsimple curves. In Figure 1b we plot the resulting curve after projection on the rst two principal components of the data, where we note an intersection point. However, there can exist projections that lead to simple curves; one such solution is shown in Figure 1c. Although this setting might seem articial, in fact it can be very realistic. It is often the case with real-life applications that a system generates data that lie in a high-dimensional space but the system itself has limited degrees of freedom. One example is a robot that moves along a corridor with xed orientation and observes its environment with sensors. Although the observations are high-dimensional, they are generated by a system with a single degree of freedom (translation of the robot) and thus can have much lower intrinsic dimensionality. In particular, assuming a smooth environment, the data must form a smooth one-dimensional manifold in the original space.
Dimension Reduction of Intrinsically Low-Dimensional Data
193
The embedding of such a nonlinear structure in a high-dimensional Euclidean space is, at least from a conceptual point of view, redundant, and methods that extract relevant features from the data are in order. The simplest and most interpretable feature extraction method is a linear projection, and this should be done in such a way that the topological structure of the data is preserved during the projection. Besides, there are strong statistical arguments that necessitate the extraction of features from the data. It is known that for reasonably sized data sets, any nonparametric statistical inference in high-dimensional spaces fails due to the “curse of dimensionality,” meaning the sparsity of points in any neighborhood of interest. An illustrative numerical example of this behavior is given in Huber (1985). Extracting linear features from the data prior to modeling is the central theme of all projection pursuit techniques. In this article, we are studying the problem of dimension reduction (through linear projections) of high-dimensional data that are intrinsically low-dimensional. For clarity we will limit our exposition to intrinsically one-dimensional data, but our method can be directly applied to higher intrinsic dimensionalities. For our denition of intrinsic dimensionality, we will adopt the concept of an inverse regression curve (Li, 1991), while our main focus will be on projections onto the plane. Our projection pursuit method is supervised in the sense that the manifold parameterization of the data is known, and thus it can be regarded as a model-free, nonadditive, projection pursuit regression method (see Ripley, 1996, for a review). However, a supervised dimension-reduction method like the one we propose can be more than just regression. We rephrase (Li, 1991): During the early stages of data analysis, when one does not have a xed objective in mind, the need for estimating quantities like conditional averages (regression) is not as pressing as that for nding ways to simplify the data. After projecting the data to a good feature space, we are in a better position to identify what should be pursued further—for example, model building, response surface estimation, or cluster analysis. The proposed method can be regarded as an extension of the single-index model for nonparametric regression (Hardle, ¨ Hall, & Ichimura, 1993). This is a fully nonparametric regression model for predicting a scalar variable from a high-dimensional vector. We argue, and show with a real-life example, that for intrinsically low-dimensional data, the single-index model can be inadequate, and an extension of it is required. In the next section, we formalize the problem, and in section 3 we describe the proposed method and compare it with the single-index model. The important issues of optimal smoothing and optimization are considered in section 4. In section 5 we demonstrate the proposed method on two robotic applications, and in section 6 we discuss related and further research issues. 2 Intrinsic Dimensionality and Linear Projections
We start with a denition of intrinsic dimensionality.
194
N. Vlassis, Y. Motomura, and B. Krose ¨
We will say that a random vector x 2 Rd is intrinsically onedimensional with manifold variable s when the conditional average E[x | s] denes a smooth curve f ( s ) in Rd and the conditional variance Var[x | s] is bounded. The curve f ( s ) is called the inverse regression curve (Li, 1991). Denition.
We now assume a data set xi , 1 · i · n, of independent and identically distributed (i.i.d.) observations of an intrinsically one-dimensional random vector x 2 Rd , with corresponding manifold values si . According to the denition above, the d-dimensional data x i are constrained to live in a one-dimensional smooth manifold embedded in Rd , plus noise. One such example is the spiral data shown in Figure 1a, with d D 3, where the inverse regression curve f ( s ) that generated the data is also plotted. We are interested in reducing the dimensionality of such data by linearly projecting them to a subspace Rq , 1 < q < d, multiplying them with a d £ q matrix W with orthonormal columns, yi D W T xi ,
1 · i · n,
W T W D Iq ,
(2.1)
where Iq stands for the q-dimensional identity matrix. Clearly, the above mapping, being continuous, will project the original inverse regression curve f ( s ) to a new curve g ( s) , the inverse regression curve of the projected variable y given the manifold variable s. We argued in section 1 that feature extraction from such data is imperative. Moreover, it is often the case that for relatively large d, there is a linear projection that maps the original curve f ( s) to a plane curve g ( s ) that preserves most of the topological structure of the original data. One way to nd such a projection is by forcing the projected curve to be as near to a simple curve as possible—in other words, by minimizing the number of self-intersections. A formal way to compute the number of self-intersections of g ( s ) is by monitoring the average modality of the conditional density p ( s | y ) . Clearly, if g ( s ) is simple, p ( s | y ) will exhibit a single mode for all values of y , a single intersection of g ( s ) at y D y o will make p( s | y D y o ) bimodal, and so forth. This observation makes the task of projecting to a simple curve easier because there are several methods for assessing the degree of modality of a distribution. Moreover, since this will be part of an optimization procedure, the measure of multimodality of p( s | y ) must be computed quickly. The above discussion implies that we need a nonparametric estimate of p ( s | y ). For an appropriate sequence of weights lj ( y ) , 1 · j · n, such an estimate pO( s | y ) is (Stone, 1977) pO( s | y ) D
n X j D1
lj ( y ) w hs ( s ¡ sj ) ,
(2.2)
Dimension Reduction of Intrinsically Low-Dimensional Data
195
where w hs ( s ) D p
³ ´ s2 exp ¡ 2 2hs 2p hs 1
(2.3)
is the univariate gaussian kernel with bandwidth hs , dening a local smoothing region around s. A weight function lj ( y ), which satises the conditions in Stone (1977) and makes the above estimate a smooth function of the projection matrix W , is w hy ( y ¡ yj ) lj ( y ) D Pn , kD 1 w h y ( y ¡ y k ) where
³
ky k2 w hy ( y ) D exp ¡ 2 q 2hy (2p ) q / 2 hy 1
(2.4)
´ (2.5)
is the q-dimensional spherical gaussian kernel with bandwidth hy . The two kernel bandwidths hy and hs are the only free parameters of the model pO( s | y ) , and their values affect the resulting projections. We discuss possible choices later. 3 Nonparametric Measures of Multimodality
The above discussion has made it clear that we need a way to assess the degree of modality of p( s | y ) using the nonparametric estimate pO( s | y ) from equation 2.2. This estimate must be computed quickly, a requirement that renders approximations more favorable than accurate methods. 3.1 The Proposed Model. The proposed measure is based on the simple observation that for a given point x i , which is projected through equation 2.1 to y i , the density p( s | y D y i ) will always exhibit a mode on s D si . Moreover, if g ( s ) is a simple curve, this will be the only mode of p ( s | y D y i ) . Thus, an approximate measure of multimodality is the Kullback-Leibler ¨ distance between p ( s | y D yi ) and a unimodal density sharply peaked at s D si , giving the approximate estimate ¡ log p ( si | y D y i ) plus a constant. 1 Averaging over all points y i , we have to minimize the risk
RK D ¡ 1
n 1X log pO( si | y D y i ) , n iD 1
(3.1)
As one reviewer pointed out, it would be more accurate to call this a measure of unimodality since for a given peak of p(s | y D yi ) at s D si and any number of additional modes, this measure would not discriminate between these distributions. We come back to this point in section 6.4.
196
N. Vlassis, Y. Motomura, and B. Krose ¨
with pO( s | y D y i ) the nonparametric estimate of p ( s | y D y i ) computed from equation 2.2. This risk can be regarded as the negative average loglikelihood of the data given a model dened by the kernel bandwidths hy , hs , and the projection matrix W . Applying the Bayes’ rule on p ( si | y D y i ) gives an alternative interpretation of the above risk, RK D ¡
n n 1X 1X log p( y i | s D si ) C log p ( y i ) C const n iD 1 n i D1
(3.2)
which shows that the optimal projection tries to “stretch” the projected points y i as far from each other as possible by minimizing the (logarithm of the) density of the y vector (the second term), while reducing at the same time the “noise” of y conditional on s expressed by the negative loglikelihood of p ( y i | s D si ) (the rst term). 3.2 The Double-Index Model. If a projection to a simple curve g ( s) does exist, then p ( s | y ) is strictly unimodal and can be approximated by a gaussian with constant variance. Then the above risk becomes proportional to the mean squared error,
RD D
n 1X O ( y i ) g2 , fsi ¡ m n iD 1
(3.3)
O ( y i ) the nonparametric estimate of the expectation of p ( s | y D y i ) with m computed with the Nadaraya-Watson formula for nonparametric regression, mO ( y i ) D
n X
lj ( y i ) sj ,
(3.4)
jD 1
with lj ( y i ) from equation 2.4. Minimization of equation 3.3 with respect to the projection matrix W leads to a nonparametric regression model known as the single-index model for q D 1, multiple-index model for q > D 2 (H¨ardle et al., 1993). For projections to the plane, we adopt hereafter the term double-index model (DIM). In this model, as we see from equation 3.3, the squared distance of si to its nonparametric conditional mean mO ( y i ) acts as a diagnostic of the multimodality of p ( s | y D y i ) . This, in turn, implies a gaussian approximation to the “true” p ( s | y D y i ). At rst glance, this approximation seems too naive since it is continuously violated during optimization (at least during the rst steps of optimization), since most random projections will yield nonsimple curves. However, if a projection to a simple curve exists, the global minimum of RD must correspond to this projection. In fact, it is an individual projection that makes p ( s | y D y i ) multimodal, and assuming a gaussian
Dimension Reduction of Intrinsically Low-Dimensional Data
197
model for it does not imply a violation of its real parametric shape but is rather a constraint to be satised by the optimization algorithm. 3.3 Limitations of the Double-Index Model. The above discussion reveals two limitations of the DIM. The rst is related to optimization. Although a projection to a simple curve can exist, the mean square error RD can exhibit very sharp local minima. One way to see this is by noticing how bad an indicator of multimodality can be the squared distance to the mean: densities with the same variance can have varying higher moments and thus can exhibit different modalities. The Givens parameterization (see section 4.3) allows the visualization in Figure 2 of the landscapes of the risks RK and RD for the spiral data, where the sharp minima of RD are evident. Such a local minimum solution of the DIM is shown in Figure 3c, where for the points y i around [¡0.7, 0.9]T , the conditional density p ( s | y D y i ) is
(a)
150 30
200 30
20
20
10
10 (b)
0.2 30
0 30
20
20 10
10
Figure 2: The landscapes of the (a) negative log-likelihood RK and (b) mean squared error RD . Although the global minimum occurs for the same parameter vector, the landscape of RD exhibits sharper local minima.
198
N. Vlassis, Y. Motomura, and B. Krose ¨ (a)
(b)
2
2
1
1
0
0
1
1
2
2
1
0
1
2 2
2
1
0
1
2
(c) 2 1 0 1 2
1
0
1
2
3
Figure 3: Projections of the spiral data. (a) Double-index model, hy D 0.13. (b) Proposed method with optimized bandwidths, hy D 0.03, hs D 0.02 (c) Local minimum solution of the DIM, hy D 0.04.
approximately symmetric and trimodal, making the algorithm get stuck in a local minimum. The most severe limitation of the DIM, however, is that it can be used only when the original curve f ( s) does not intersect itself in Rd . Clearly, a multimodality of the conditional density p( s | x ) in the original space Rd will lead to an equivalent multimodality for p( s | y ), and this cannot be remedied by any projection. A typical situation is when the curve f ( s ) constitutes a closed manifold, in which case the respective density p( s | x ) will exhibit bimodality. For any parameterization of the curve, a single point on it will 6 s2 , f ( s1 ) D f ( s 2 ) . be the image of two different manifold values: 9s1 , s2 : s1 D In this case, the failure of the DIM to compute the correct projections is a direct consequence of the fact that the risk RD assumes unimodal p ( s | y ) . The proposed risk RK in equation 3.1 does not suffer from the above two limitations and can be used even for data lying on closed manifolds. This will be illustrated in section 5.1 with a real-life application.
Dimension Reduction of Intrinsically Low-Dimensional Data
199
4 Model Selection and Optimization 4.1 Kernel Smoothing. We discuss here the selection of the kernel bandwidths hy and hs in equations 2.5 and 2.3, respectively. For the DIM, this problem has been solved by the cogent proof in H¨ardle et al. (1993) that a cross-validation estimate of the Nadaraya-Watson conditional mean allows the simultaneous optimization of both the projection directions—columns of W —and the kernel bandwidth hy in equation 2.5 (note that the DIM requires no kernel smoothing on s). The cross-validation estimate of m ( y i ) is given by P ( ) 6 i w hy y i ¡ yj sj jD Q ( yi ) D P (4.1) m , ( ) 6 i w hy y i ¡ yj jD
as in equation 3.4 but with the points yi and si excluded from the summations in the numerator and the denominator. Application of the DIM to the spiral data results in the curve shown in Figure 3a. For our model we do not have such a proof, but the same methodology seems to apply as well. Specically, using the cross-validation estimate of p( s | y D yi ) , P ( ) ( ) 6 i w hy y i ¡ yj w hs s ¡ sj jD P (4.2) pQ( s | y D yi ) D , ( ) 6 i w hy yi ¡ yj jD and plugging it into the risk, equation 3.1, we were able to optimize simultaneously over the matrix W and the two kernel bandwidths hy and hs . The resulting projection of the spiral data is shown in Figure 3b. However, a lack of proof makes this procedure questionable, so we investigated the possibility of dropping the cross-validation estimate and assigning constant values to hy and hs during optimization. From Hall (1989), we know that in projection pursuit regression, the mean integrated squared error (MISE) optimal bandwidth for gaussian data hy D O ( n ¡1 / 5 ) gives direction estimates with error O ( n ¡2 / 5 ) , while under assumptions, this error can be further dropped to O ( n ¡1/ 2 ) by choosing a bandwidth between O ( n ¡1 / 3 ) and O ( n ¡1 / 4 ) . The solution we adopted for the bandwidth hy and for projections to the plane was hy D n ¡2/ 7 , which can be kept xed during optimization after sphering the data (see below). This value satises the above bounds and gives good results in practice. For the s-bandwidth we chose the gaussian MISE optimal value hs D ( 3n / 4) ¡1 / 5 (Wand & Jones, 1995). For these two values, the proposed method projected the spiral data as shown in Figure 1c. 4.2 Sphering. A sphering of the data x i —a normalization to zero mean and identity covariance matrix—makes the kernel bandwidth hy independent of the projection. Then hy can be kept constant during optimization, leading to considerable computational savings. Sphering means a rotation
200
N. Vlassis, Y. Motomura, and B. Krose ¨
of the data to their principal component analysis (PCA) directions and then standardization of the individual variances to one. To avoid modeling noise in the data, it is typical to ignore directions with small eigenvalues; a heuristic method to do this is by calculating the ratio of the cumulative variance (added eigenvalues) to the total variance. A reasonable threshold is 0.8. The numerically most accurate way to sphere the data is by singular value decomposition (Press, Teukolsky, Flannery, & Vetterling, 1992). Let X be the n £ d matrix whose rows are the data x i after they have been normalized to zero mean. For n > d, we compute the singular value decomposition p X D ULV T of the matrix X and form the matrix A D nVL ¡1 . The points XA are then sphered. For n · d, the data x i lie in general in an ( n ¡ 1)-dimensional Euclidean subspace of Rd . In this case, it is more convenient to compute the principal directions through eigenanalysis of K D XX T , the inner products matrix of the zero mean data. We compute its singular value decomposition K D ULV T and remove the last column of V and last column and row of Lp(the last eigenvalue of K will always be zero). Then we form the matrix A D nVL ¡1 . The points KA are ( n ¡ 1) -dimensional and sphered. Moreover, all projections of sphered data x i in the form of equation 2.1 also give sphered data y i because E[yyT ] D E[W T xxT W ] D W T E[xxT ]W D W T W D Iq ,
(4.3)
due to the constraint of orthonormal columns of W . This frees us from having to reestimate (co)variances of the projected data in each step of the optimization algorithm. In the following, we assume that the data x i have already been sphered and the manifold data si have been normalized to zero mean and unit variance. 4.3 Optimization. The smooth form of the risk RK as a function of W , hy , and hs allows the minimization of the former with nonlinear optimization. For constrained optimization, we must compute the gradient of RK and the gradient of the constraint function W T W ¡ Iq with respect to W , and then plug these estimates in a constrained nonlinear optimization routine to optimize with respect to RK (Gill, Murray, & Wright, 1981). The gradient of RK with respect to W is given in section A.1. An alternative approach that avoids using constrained nonlinear optimization, in a similar problem involving kernel smoothing for discriminant analysis, has recently been proposed by Torkkola and Campbell (2000). The idea is to parameterize the projection matrix W by a product of Givens (Jacobi) rotation matrices (Press et al., 1992) and then optimize with respect to the angle parameters involved in each matrix. For projections from Rd to Rq , this parameterization takes the form W D
q d Y Y oD 1 u D q C 1
G ou ,
(4.4)
Dimension Reduction of Intrinsically Low-Dimensional Data
201
where G ou is a Givens rotation matrix that equals Id except for the elements goo D coshou , gou D sin hou , guo D ¡ sin hou , and guu D coshou for an angle hou , which depends on o and u. For simplicity we let in the above notation goo , gou , . . . denote the ( o, o ) -th, ( o, u ) -th, . . . elements of the matrix G ou , respectively. To ensure that W is d £ q, only the rst q columns of the last matrix G qd in equation 4.4 are retained, while multiplications must be carried from right to left to reduce the evaluation cost (see section A.3). Multiplication with a matrix G ou causes a rotation by hou over the plane dened by the dimensions o and u, while the range of indices in equation 4.4 ensures that all rotations take place over planes dened by at least one nonprojective direction, that is, one among the d ¡q remaining dimensions. This fact also reduces the total number of parameters from qd in the constrained optimization case (elements of matrix W ) to q ( d ¡ q) here (angles hou ). The required derivatives are given in section A.2. However, as we see in Figure 2, the mixture-like form of equation 2.2 and the additional trigonometric functions in equation 4.4 can make the landscape of the risk RK have numerous local minima. For this reason, combining a gradient-free optimization method like Nelder-Mead with nonlinear optimization is requisite. Also a projection (through sphering) to a lowdimensional Euclidean subspace prior to optimization can signicantly facilitate the search. In any case, the optimization algorithm must be applied many times, and the solution with the minimum risk must be retained. In section A.3 we derive some complexities and discuss some implementation issues. 5 Applications 5.1 Appearance Models for Visual Servoing. One potential application of the proposed method is in robot visual servoing: the ability of a robot manipulator to position itself to a desired position with respect to an object, using visual information from a camera mounted on the robot’s end-effector. Usually, this task is carried out using a geometric representation of the object, which is used for estimating the displacement of the robot with respect to the object, and then feeding this estimate to an appropriate controller. However, building such geometric models can be cumbersome when dealing with objects with complex shapes. Nayar, Nene, and Murase (1996) describe an alternative approach where the appearance of the object is modeled for different poses of the camera. This appearance model is learned by capturing a large set of object images by incrementally displacing the robot’s end-effector. Because all images are from the same object, consecutive images are strongly correlated, given that no large variations in lighting conditions occur. The image set is compressed using PCA to obtain a low-dimensional subspace. In this low-dimensional space, the variations due to the robot displacement are represented in the form of a parameterized manifold. The manifold is then a continuous representation of the visual workspace of the servoing task.
202
N. Vlassis, Y. Motomura, and B. Krose ¨
Figure 4: A subset of the duck images. Closing the loop results in a closed manifold in the image space.
Visual positioning is achieved by rst projecting a new image to the eigenspace and then determining the robot displacement to the desired position by the location of the projection on the parameterized manifold. In this application, it is obvious that the manifold should be nonintersecting; otherwise, two different robot poses will correspond to the same low-dimensional representation, leading to bad pose estimation. In Nayar et al. (1996), the number of eigenvectors was high enough in order to avoid intersections. Here we illustrate that our supervised dimension-reduction method gives a better solution, in the sense that the dimensionality of the resulting subspace in which the parameterized manifold lives can be lower. This can be very important when nonparametric statistical inference in the reduced space is in order. We carried out an experiment involving images of an object (a duck) corresponding to consecutive views obtained every 10 degrees of rotation about a single axis of the object. The data set is supervised in the sense that the angle of view in each image is known. The last image coincides with the rst image in the sequence, giving a total set of 37 images forming a closed manifold in the image space. A subset of these images is shown in Figure 4. We sphered the data by applying PCA using the inner products matrix of the zero-mean data (see section 4.2). It turned out that the rst four eigenvectors (with the largest eigenvalues) explained more than 80% of the total variance, so we discarded the other dimensions. From this four-dimensional space, projecting to the plane using the proposed method was easy. In Figure 5, we show the resulting projections using PCA, the proposed method with optimized and xed bandwidths, and the double-index model. We observe that PCA and the DIM project to nonsimple curves, whereas the proposed method successfully nds a projectionto a simple curve, verifying our theoretical claims. If the resulting manifold is to be used for visual servoing, it is therefore advisable to use a supervised linear projection instead of PCA. 5.2 Mobile Robot Localization. A second application of the proposed method is in mobile robot localization, a term referring to the ability of a robot to estimate its position successfully in its workspace at any moment. We show here the importance of having a good low-dimensional representation (features) of the robot observations. In contrast to the previous application where visual servoing can be carried out by learning a direct
Dimension Reduction of Intrinsically Low-Dimensional Data
203
(b)
(a) 1.5
1
1 0
0.5 0
1
0.5 1 1.5
2 1
0
1
2
1
(c)
2
1
(d) 2
1
1
0
0
1
1
2
0
1
0
1
2 1
0
1
2
Figure 5: Projections of the duck image data from 4D space. (a) PCA projection. (b) Proposed method with optimized bandwidths, hy D 0.08, hs D 0.1. (c) Proposed method with xed bandwidths, hy D 0.36, hs D 0.52. (d) Double-index model, hy D 0.47. The dotted lines indicate where the manifold closes.
mapping from features to poses (e.g., using a neural network; see Nayar et al., 1996), here the necessity for propagating the location predictions over time, as we show next, renders direct mappings impossible. In particular, the inverse mapping from robot positions to observation features is needed for localization. Let us assume that at time t, the robot has a belief—a probability density function, bt ( st ) of its position in the workspace. We also assume two probabilistic models: (1) a model of the robot kinematics in the form of the conditional density p( st | ut¡1 , st¡1 ) where ut¡1 is an action that takes the robot from its previous position st¡1 to st , and (2) an observation model p ( y t | st ) that realizes the inverse mapping from robot positions to features y . Using the Markov assumption that the past has no effect on observations beyond the previous time step, the update rule for the robot belief is (Thrun, 2000) Z bt ( st ) / p ( y t | st )
p ( st | ut¡1 , st¡1 ) bt¡1 ( st¡1 ) dst¡1 .
(5.1)
204
N. Vlassis, Y. Motomura, and B. Krose ¨
Figure 6: The robot trajectory in our building.
Moreover, a convenient way to implement this density propagation scheme is through Monte Carlo sampling from the various densities in the above equation. This amounts to approximating the belief density bt ( st ) with a weighted set of “particle” points sampled from bt ( st ) , in which case equation 5.1 enjoys a very simple implementation while the method has been shown to give very satisfactory results in practice. (For details, we refer to Thrun, 2000.) We applied the above algorithm to data collected by a Nomad Scout robot following a predened trajectory in our mobile robot lab and the adjoining hall, as shown in Figure 6. The omnidirectional imaging device mounted on top of the robot consists of a vertically mounted standard camera aimed upward, looking into a spherical mirror. The data set contains 104 omnidirectional images (320 £ 240 pixels) captured every 25 centimeters along the robot path. Each image is transformed to a panoramic image (64 £ 256), and these 104 panoramic images, together with the robot positions along the trajectory, constitute the training set of our algorithm. A typical panoramic image shot at the position A of the trajectory is shown in Figure 7. In order to apply our supervised projection method, we rst sphered the panoramic image data using the inner products matrix, as explained above, and kept the rst 10 dimensions explaining about 60% of the total variance. The robot positions were normalized to zero mean and unit variance. Then we applied our method projecting the sphered data points from 10-D to 2-D. In Figure 8 we plot the resulting two-dimensional projections using PCA and our supervised projection method. We clearly see the advantage
Dimension Reduction of Intrinsically Low-Dimensional Data
205
Figure 7: Panoramic snapshot from position A in the robot trajectory.
of the proposed method over PCA. The risk is smaller, and from the shape of the projected manifold, we see that taking into account the robot position during projection can signicantly improve the resulting features: there are fewer self-intersections of the projected manifold in our method than in PCA, which means better robot position estimation on the average (smaller risk). In Figure 9 we show the localization performance of the robot when running the Monte Carlo localization method using both the PCA and the supervised features. For the robot kinematics, we assumed a simple gaussian model with standard deviation half the traveled distance of the robot in each action ut¡1 (translation), while the observation model p ( y | s ) is given by using equation 2.2 with the roles of s and y interchanged and using the same (xed) kernel bandwidths, Pn p ( y | s) D
j D 1 w hs ( s ¡ sj ) w hy ( y Pn j D1 w hs ( s ¡ sj )
¡ yj )
.
(5.2)
We placed the robot in position B on the trajectory and planned its translation toward position C. The corresponding projections of the panoramic images in the B!C traversed part of the trajectory are shown with dashed lines in Figure 8. No prior knowledge about the initial position of the robot was provided (kidnapped robot problem), in which case the initial belief b1 ( s1 ) is a uniform density over the s space. In Figure 9 we plot the estimated position of the robot as the average of bt ( st ) at any moment t ¸ 2, together with 1s error bars, when using Monte Carlo localization with the PCA projections of Figure 8a, and the supervised projections of Figure 8b. We see that by using the supervised projections, the algorithm converges faster to the true position of the robot (dashed lines) and is more stable than the PCA solution. This result clearly shows the advantage of using a supervised dimension-reduction method instead of an unsupervised one like PCA. In both two applications, we tried both constrained nonlinear optimization using sequential quadratic programming (Gill et al., 1981) and unconstrained optimization using the Broyden-Fletcher-Goldfard-Shanno (BFGS)
206
N. Vlassis, Y. Motomura, and B. Krose ¨
Figure 8: Projection of the panoramic image data from 10D space. (a) Projection on the rst two principal components. (b) Supervised projection using the proposed method with xed bandwidths. The part with the dashed lines corresponds to projections of the panoramic images captured by the robot between positions B and C of its trajectory.
algorithm with the Givens parameterization. The latter was more sensitive to local minima but many times faster than sequential quadratic programming, so it was our main choice. We repeated the optimization algorithm several times starting from random values for the Givens angles in
Dimension Reduction of Intrinsically Low-Dimensional Data
207
Figure 9: Localization performance of the robot (estimated position versus true position) using Monte Carlo localization with (a) the PCA features, and (b) the supervised features, for a translation from point B to point C along the trajectory.
[¡p / 2, p / 2] and small values for the bandwidths in the case the latter were also optimized. In order to make sure that the global minimum solution was reached, we ran the Nelder-Mead algorithm until convergence, and from this solution we started nonlinear optimization with the BFGS, a common practice in similar optimization problems (see, e.g., H¨ardle et al., 1993). A gradient-
208
N. Vlassis, Y. Motomura, and B. Krose ¨
free method like Nelder-Mead requires more evaluations of the objective function than BFGS, but this is compensated by the better global convergence properties of the former. The complexity analysis in section A.3 can be used for efcient evaluation of the objective function and its derivatives. A Matlab implementation is available from the rst author. 6 Discussion
We have described a method for supervised dimension reduction (feature extraction) of intrinsically low-dimensional data and justied its use with some real-life applications. One issue that should be reemphasized is that the proposed method should not be regarded as a regression method (although it can also be used for regression) but rather as a feature extraction method that takes into account the supervised information of the data during optimization. The mobile robot application provided a clear motivation for the use of the method in the sense described. The resulting features might need to be used for realizing the inverse mapping from label data (robot positions) to feature data, and not necessarily for predicting the label data from new observations. Besides, the extracted features can also be used for other purposes (Li, 1991). It is also this fact that renders the nonparametric models adopted in this article attractive: as long as we are interested in the projection matrix W only, the model that we must use during optimization must be succinct and employ as few parameters other than the elements of W as possible. In our case, the kernel bandwidths are also model parameters to be estimated, but typically they will be be used after feature extraction for kernel smoothing in the resulting y and s spaces. In the rest of this section, we discuss some additional topics related to the method and outline some related and future work. 6.1 Information-Theoretic Concepts. A nonparametric estimate of the modality of p ( s | y ) must be computed quickly and thus be approximate. A more principled approach to the proposed Kullback-Leibler distance to a peaked density would be an information-theoretic one: an indicator of the number of peaks of p ( s | y D y i ) is the entropy of p( s | y D y i ) , which reaches its minimum value when the density is unimodal (note that s conditioned on y D y i does not have constant variance). Averaging over all y i corresponds to mutual information maximization between the projected variable y and the manifold variable s, which, since s is independent of the projection, corresponds to minimizing the risk (where we have used the empirical estimate of p( y ) ):
RH D ¡
n 1X n iD 1
Z p ( s | y D y i ) log p ( s | y D y i ) dsI
(6.1)
Dimension Reduction of Intrinsically Low-Dimensional Data
209
however, the presence of the integral and the logarithm makes the evaluation of this quantity costly. One possibility is to replace the Shannon entropy in the above formula with the Renyi entropy of order two (Cover & Thomas, 1991), Z ( ) | y y ¡ log (6.2) h2 s D i D p ( s | y D y i ) 2 ds, and then use the nonparametric estimate of p( s | y ) from equation 2.2. This makes the integral in equation 6.1 analytical (Principe & Xu, 1999); however, unless some application-specic approximations are employed, the cost of the above estimate is O ( n3 ). Finally, we note from equation 3.1 that the log-likelihood risk RK can be regarded as a rough approximation to the above integral, equation 6.1, assuming that p ( s | y D y i ) is sufciently peaked about si . The gained speedup of one order of magnitude (or more, see the discussion about fast Fourier transform below) can be a good justication for the approximation. 6.2 Dropping the Complexity with the Fast Fourier Transform. As shown in section A.3, the bottleneck of the method lies in the quadratic complexity of computing all the pairwise distances of the projected points y i and then smoothing with the p-gaussian kernel in equation 2.4. However, one property of the gaussian kernel is that smoothing can be efciently carried out with the fast Fourier transform (Silverman, 1982). This involves a two-stage approach where the projected data are rst binned—an approximate histogram of them is computed—and then kernel smoothing is carried out by discrete convolution with a discretized gaussian kernel (see Wand & Jones, 1995, for details). This approach drops the quadratic cost to O( n log2 n ) which is the cost of discrete convolution with fast Fourier transform. 6.3 Optimization Issues. The efciency of the optimizer to locate the global minimum solution depends on the original dimension d of the data. In our experiments we noticed that for large values of d (e.g., d > 10), the optimizer often had difculties locating the global minimum solution. If d is too large (e.g., images), then dimension reduction through sphering can help lowering d to some p manageable number. Sphering to a dimension approximately equal to n is a reasonable choice, as justied by the complexity analysis in section A.3. However, if projections from large d are in order, alternative optimization methods might be needed—for example, stochastic search or xed-point iterations. In particular, the log-likelihood form of equation 3.1 suggests that optimization with the expectation-maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977) might be possible, although in its current form, the M-step is not analytical. An alternative denition of the weight function lj ( y ) in equation 2.4 can make the M-step analytical, and we are investigating this possibility.
210
N. Vlassis, Y. Motomura, and B. Krose ¨
Another interesting topic is the relaxation of the constraint of orthonormality of the columns of W allowing more general features to be discovered. Such a step would require a better optimization strategy in order to avoid columns of W converging to singular or degenerate solutions. Moreover, since equation 4.3 would not hold any more, the hy kernel bandwidth could not be kept xed during optimization and should be either optimized using the cross-validation approach mentioned above or computed by estimating the (co)variances of the projected points in each optimization step. 6.4 An Alternative Measure of Multimodality. Another measure of multimodality of p( s | y ) in a supervised nonlinear feature extraction problem has been proposed by Thrun (1998). There, multimodality was measured by the average distance to the “correct” mode si giving the risk n 1X RL D n iD 1
Z |s ¡ si |p(s | y D y i ) ds,
(6.3)
which penalizes modes that appear far from si . In Thrun (1998) this risk was approximated from the sample with cost O ( n3 ). However, if instead of absolute loss function |s ¡ si | we take square loss ( s ¡ si ) 2 and use the nonparametric estimate of p ( s | y ) from equation 2.2, the risk reads Z n X n 1X ( ) ( s ¡ si ) 2 w hs ( s ¡ sj ) ds, lj y i (6.4) RL D n iD 1 jD 1 and the above integral can be analytically computed as ( si ¡sj ) 2 C h2s , yielding the risk (plus a constant) RL D
n X n 1X lj ( y i ) ( si ¡ sj ) 2 . n iD 1 jD 1
(6.5)
This risk is O( n2 ) and can be regarded as an approximate Bayes risk with negative log-likelihood loss (Stone, 1977). Moreover, Jensen’s inequality can relate this risk to both RK and RD . For the latter, write 8 92 <X = O ( y i ) g2 D fsi ¡ m lj ( y i ) ( si ¡ sj ) , (6.6) : ; j
and then apply E[x]2 · E[x2 ]. Plots of the parameter landscape for this risk showed a large resemblance to the landscape of RD . This can be explained by the fact that in both risks, no smoothing is carried out in the s space, in contrast to the proposed risk RK , which essentially ignores modes of p ( s | y D y i ) that appear far from the true
Dimension Reduction of Intrinsically Low-Dimensional Data
211
position si . As we showed in the mobile robot localization application, positioning is achieved by propagating the belief estimates over time; therefore, far-away modes of the conditional density p ( s | y D y i ) are easier to handle by the localization mechanism, while modes close to si are more likely to cause a problem. This fact also justies the use of the Kullback-Leibler distance in the denition of RK in equation 3.1 as a measure of multimodality of p ( s | y D y i ) . 6.5 Unsupervised Dimension Reduction. Finally, a challenging problem is the unsupervised feature extraction of intrinsically low-dimensional data, where “unsupervised” means that the manifold variable is unobserved in the sample. In this case, the inverse regression curve must be substituted by a principal curve (Hastie & Stuetzle, 1989), and projection direction estimation must be interleaved with curve tting. We are not aware of any such work and believe that it deserves attention. 7 Conclusion
We proposed a supervised dimension-reduction method that can be used for data sets that are intrinsically low-dimensional. We discussed the case of one-dimensional intrinsic dimensionality and linear projections to the plane, but the method can be applied to other congurations as well. We compared with the single-index (double-index) model for nonparametric regression and showed that in some cases, the proposed method can provide more accurate solutions. We derived gradients to be used with either constrained or unconstrained nonlinear optimization and carried out a time complexity analysis of the associated estimates. We demonstrated the proposed method on two robotic applications and showed its potential over other methods. Appendix A.1 Derivatives of RK : Constrained Optimization Case. We compute the derivatives of the risk RK in equation 3.1 assuming that p ( s | y ) is given by equation 2.2. The results extend easily to the cross-validation case. It is not difcult to see that the derivative of RK with respect to a parameter r (element of W or angle h kl ) is @ @r
RK D
n X n ¡1 X @ bij ky i ¡ yj k2 , 2 2nhy iD 1 jD 1 @r
(A.1)
where w hy ( y i ¡ yj ) w hs ( si ¡ sj ) . bij D lj ( y i ) ¡ Pn k D1 w hy ( y i ¡ y k ) w hs ( si ¡ s k )
(A.2)
212
N. Vlassis, Y. Motomura, and B. Krose ¨
In the constrained optimization case, we can directly compute the gradient
rW ky i ¡ yj k2 D rW ( y i ¡ yj ) T ( y i ¡ yj ) D 2( xi ¡ xj ) ( y i ¡ yj ) T I
(A.3)
substituting in equation A.1 and expanding, we get 8 X X X X X ¡1 <X rW RK D xi y Ti xi xj bij ¡ bij yjT ¡ bij y Ti 2 nhy : i i i j j j 9 X X = C xj yjT (A.4) bij . ; i j
The rst term is zero by the denition of bij in equation A.2, while the rest can be written in the convenient matrix notation,
rW RK D
1 T X [B C B T ¡ diag( 1 T B )]Y , nh2y
(A.5)
where X is the n £ d matrix of the sphered data, Y is the n £ q matrix of the projected data, B is the n £n matrix with elements bij , 1 is a column vector of all ones, and diag(¢) transforms a vector to a diagonal matrix. Multiplying from right to left leads to reduced evaluation cost (see section A.3). A.2 Derivatives of RK : Unconstrained Optimization Case. In the unconstrained optimization case, it is not difcult to see that it holds @ @hkl
³ ky i ¡ yj k2 D 2( x i ¡ xj ) T
@ @hkl
´ W
( y i ¡ yj ) T ,
(A.6)
where the derivative of W with respect to hkl can be computed from equation 4.4 as @ @hkl
W D
q d Y Y
@
@h kl oD 1 uD q C 1
G ou ,
(A.7)
where @ @hkl
» G ou D
G 0ou G ou
if k D o and l D u otherwise
(A.8)
and G 0ou is the matrix G ou with the ones substituted by zeros and the trigonometric functions substituted by their derivatives. Expanding equation A.1
Dimension Reduction of Intrinsically Low-Dimensional Data
213
using equation A.6, we get the derivative of RK with respect to an angle hkl as » ³ ´ ¼ 1 @ @ T[ C T T )] trace X W Y B B ¡ diag( 1 B . (A.9) RK D @hkl nh2y @hkl Using the permutation property of the trace and the gradient (see equation A.5), the above derivative can be simplied as » ³ ´¼ @ W RK D trace ( rW RK ) T , @hkl @hkl @
(A.10)
in accordance with the chain rule for derivatives. Note that the gradient rW RK needs to be computed only once. A.3 Complexities. We derive here some costs for the evaluation of relevant quantities in the non-cross-validation case, with immediate impact on the algorithmic speed-up. The cost of a projection through equation 2.1 is O( qdn ) , while a single evaluation of RK through equations 3.1 and 2.2 requires the computation of all pairwise distances between the projected points in equation 2.4, which is O ( qn2 ) . An approximation via fast Fourier transform (FFT) (see section 6) can drop this cost to at most O ( qn log2 n ), giving the total cost for evaluating RK O ( qn ¢ maxfd, log2 ng ) . If W is parameterized by Givens matrices, there is the additional cost of the multiplications in equation 4.4. Since the last matrix Gqd in the sequence is d £ q, the multiplications must be carried out from right to left, giving total complexity for the product O ( q( d ¡ q) qd2 ). This makes a single evaluation of RK using the Givens parametrization and the FFT O ( q ¢ maxf( d ¡ q) qd2 , nd, n log2 ng ). For the derivatives, we do not have results using FFT, so the following applies to the non-FFT case. Constrained optimization requires the evaluation of the gradient rW RK in equation A.5, which, for the common case n > d, has cost O ( qn2 ) . This is achieved if we carry out the multiplications in equation A.5 from right to left,
rW RK D
² 1 T± X [B C B T ¡ diag( 1 T B ) ]Y , 2 nhy
(A.11)
and since q (columns of Y ) < d (columns of X ). In the unconstrained optimization case, the cost of computing @W / @hkl in equation A.7 for a single angle equals the cost of evaluating W in equation 4.4, which, is O( q2 ( d ¡ q) d2 ) . Since the trace in equation A.10 has negligible cost O ( qd) , and the cost of rW RK above is O ( qn2 ) , we conclude that the cost of computing all q ( d ¡ q ) derivatives in equation A.10 is O ( q ¢ maxfn2 , q2 ( d ¡ q) 2 d2 g) . This suggests that a sensible dimension to compress
214
N. Vlassis, Y. Motomura, and B. Krose ¨
p the data with sphering is approximately n. However, we should note that the above complexities are in fact lower when the algorithm is implemented in a software specializing in matrix computations. Acknowledgments
We are indebted to R. Bunschoten for providing the data sets and assisting in the experiments. We thank J. Portegies Zwart and J. J. Verbeek for inspiring discussions, K. Fukumizu for bringing Li (1991) to our attention, and the anonymous reviewers for their motivating comments. This work is supported by the Real World Computing program. References Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc. B, 39, 1–38. Gill, P. E., Murray, W., & Wright, M. (1981). Practical optimization. London: Academic Press. Hall, P. (1989). On projection pursuit regression. Ann. Statist., 17(2), 573–588. H¨ardle, W., Hall, P., & Ichimura, H. (1993). Optimal smoothing in single-index models. Ann. Statist., 21(1), 157–178. Hastie, T., & Stuetzle, W. (1989). Principal curves. J. Amer. Statist. Assoc., 84, 502–516. Huber, P. J. (1985). Projection pursuit (with discussion). Ann. Statist., 13, 435–525. Li, K.-C. (1991). Sliced inverse regression for dimension reduction. J. Amer. Statist. Assoc., 86(414), 316–342. Nayar, S. K., Nene, S. A., & Murase, H. (1996). Subspace methods for robot vision. IEEE Trans. Robotics and Automation, 12(5), 750–758. Press, W. H., Teukolsky, S. A., Flannery, B. P., & Vetterling, W. T. (1992).Numerical recipes in C (2nd ed.). Cambridge: Cambridge University Press. Principe, J. C., & Xu, D. (1999). Information-theoretic learning using Renyi’s quadratic entropy. In Int. Workshop on Independent Component Analysis and Blind Separation of Signals (pp. 407–412). Aussois, France. Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge: Cambridge University Press. Silverman, B. W. (1982). Kernel density estimation using the fast Fourier transform. Appl. Statist., 31, 93–99. Spivak, M. (1979). A comprehensive introduction to differential geometry vol. II (2nd ed.). Berkeley, CA: Publish or Perish. Stone, C. J. (1977). Consistent nonparametric regression (with discussion). Ann. Statist., 5, 595–645. Thrun, S. (1998). Bayesian landmark learning for mobile robot localization. Machine Learning, 33(1).
Dimension Reduction of Intrinsically Low-Dimensional Data
215
Thrun, S. (2000). Probabilistic algorithms in robotics (Tech. Rep. No. CMU-CS-00126). Pittsburgh, PA: School of Computer Science, Carnegie Mellon University. Torkkola, K., & Campbell, W. (2000). Mutual information in learning feature transformations. In Proc. Int. Conf. on Machine Learning. Stanford, CA. Wand, M. P., & Jones, M. C. (1995). Kernel smoothing. London: Chapman & Hall.
Received February 29, 2000; accepted April 3, 2001.
LETTER
Communicated by Naftali Tishby
Clustering Based on Conditional Distributions in an Auxiliary Space Janne Sinkkonen janne.sinkkonen@hut. Samuel Kaski samuel.kaski@hut. Neural Networks Research Centre, Helsinki Universityof Technology,FIN-02015 HUT, Finland We study the problem of learning groups or categories that are local in the continuous primary space but homogeneous by the distributions of an associated auxiliary random variable over a discrete auxiliary space. Assuming that variation in the auxiliary space is meaningful, categories will emphasize similarly meaningful aspects of the primary space. From a data set consisting of pairs of primary and auxiliary items, the categories are learned by minimizing a Kullback-Leibler divergence-based distortion between (implicitly estimated) distributions of the auxiliary data, conditioned on the primary data. Still, the categories are dened in terms of the primary space. An online algorithm resembling the traditional Hebb-type competitive learning is introduced for learning the categories. Minimizing the distortion criterion turns out to be equivalent to maximizing the mutual information between the categories and the auxiliary data. In addition, connections to density estimation and to the distributional clustering paradigm are outlined. The method is demonstrated by clustering yeast gene expression data from DNA chips, with biological knowledge about the functional classes of the genes as the auxiliary data. 1 Introduction
Clustering algorithms and their goals vary, but it is common to aim at clusters that are relatively homogeneous while data in different clusters are dissimilar. The results depend totally on the criterion of similarity. The difcult problem of selecting a suitable criterion is commonly addressed by feature extraction and variable selection methods that dene a metric in the data space. Recently, metrics have also been derived by tting a generative model to the data and using information-geometric methods for extracting a metric from the model (Hofmann, 2000; Jaakkola & Haussler, 1999; Tipping, 1999). We study the related case in which additional useful information exists about the data items during the modeling process. The information is Neural Computation 14, 217–239 (2001)
° c 2001 Massachusetts Institute of Technology
218
Janne Sinkkonen and Samuel Kaski
available as auxiliary samples ck; they form pairs ( x k, ck ) with the primary samples x k. In this article, x k 2 Rn , and the ck is multinomial. The extra information may, for example, be labels of functional classes of genes, as in our case study. It is assumed that differences in the auxiliary data indicate what is important in the primary data space. More precisely, the difference between samples x k and xl is signicant if the corresponding values ck and cl are different. The usefulness of this assumption depends, of course, on the choice of the auxiliary data. Since the relationship between the auxiliary data and the primary data is stochastic, we get a better description of the difference between values x and x 0 by measuring differences between the distributions of c, given x and x 0 . The conditional densities p ( c | x ) and p( c | x 0 ) are not known, however. Only the set of sample pairs f( x k, ck ) gk is available. Because our aim is to minimize within-cluster dissimilarities, the clusters should be homogeneous in terms of the (estimated) distributions p( c | x ) . In order to retain the potentially useful structure of the primary space, we use the auxiliary data only to indicate importance and dene the clusters in terms of localized basis functions within the primary space. Such a clustering can then be used later for new samples from the primary data space even when the corresponding auxiliary samples are not available. Very loosely speaking, our aim is to preserve the topology of the primary space but measure distances by similarity in the auxiliary space. Clearly, almost any kind of paired data is applicable, but only good auxiliary data improve clustering. If the auxiliary data are closely related to the goal of the clustering task, as, for example, a performance index would be, then the auxiliary data guide the clustering to emphasize the important dimensions of the primary data space and to disregard the rest. This automatic relevance detection is the main practical motivation for this work. We previously constructed a local metric in the primary space that measures distances in that space by approximating the corresponding differences between conditional distributions p ( c | x ) in the auxiliary space (Kaski, Sinkkonen, & Peltonen, in press). The metric can be used for clustering, and then maximally homogeneous clusters in terms of the conditional distributions appear. In this work, we introduce an alternative method that, contrary to the approach generating an explicit metric, does not need an estimate of the conditional distributions p ( c | x ) as an intermediate step. We additionally show that minimizing the within-cluster distortion is equivalent to maximizing the mutual information between the basis functions used for dening the clusters (interpreted as a multinomial random variable) and the auxiliary data. Maximization of mutual information has been previously used for constructing neural representations (Becker, 1996; Becker & Hinton, 1992). Other related works and paradigms include learning from (discrete) dyadic data (Hofmann, Puzicha, & Jordan, 1998) and
Clustering in an Auxiliary Space
219
distributional clustering (Pereira, Tishby, & Lee, 1993) with the information bottleneck (Tishby, Pereira, & Bialek, 1999) principle. 2 Clustering Based on the Kullback-Leibler Divergence
We seek to cluster items x of the data space by using the information within a set of pairs ( x k, ck ) of data. The set consists of paired samples of two random variables. The vector-valued random variable X takes values x 2 X ½ Rn , and the ck (or sometimes just c) are values of the multinomial random variable C. We wish to keep the clusters local with respect to x but measure similarities between the samples x by the differences of the corresponding conditional distributions p ( c | x ) . These distributions are unknown and will be implicitly estimated from the data. Vector quantization (VQ) or, equivalently, K-means clustering, is one approach to categorization. In VQ, the goal is to minimize the average distortion E between the data and the prototypes or code book vectors mj , dened by XZ (2.1) ED yj ( x ) D ( x , mj ) p ( x ) dx. j
Here, D( x , mj ) denotes the measure of distortion between x and mj , and yP j ( x ) is the cluster membership function that fullls 0 · yj ( x ) · 1 and j yj ( x ) D 1. In the classic “hard” vector quantization, the membership function is binary valued: yj ( x ) D 1 if D ( x , mj ) · D ( x, m i ) for all i, and yj ( x ) D 0 otherwise. Such functions dene a partitioning of the space into discrete cells, Voronoi regions, and the goal of learning is to nd the partitioning that minimizes the average distortion. If the membership functions yj ( x ) may attain any values between zero and one, the approach may be called soft vector quantization (Nowlan, 1990, has studied a maximum likelihood solution). We measure distortions D as differences between the distributions p ( c | x ) and model distributions. The measure of the differences will be the Kullback–Leibler divergence, dened for two discrete-valued distribuP tions with event probabilities fpi g and fyi g as DKL ( p, Ã ) ´ i pi log( pi / yi ). In our case, the rst distribution is the multinomial distribution in the auxiliary space that corresponds to the data x, that is, pi ´ p ( ci | x ) . The second distribution is the prototype; let us denote the jth prototype by Ãj . When the Kullback–Leibler distortion measure is plugged into equation 2.1, the error function of VQ, the average distortion becomes XZ (2.2) EKL D yj ( x ) DKL ( p ( c | x ) , Ãj ) p ( x ) dx . j
Instead of the distortions between the vectorial samples and vectorial prototypes as in equation 2.1, we now compute point-wise distortions between
220
Janne Sinkkonen and Samuel Kaski
the distributions p( c | x ) and the prototypes Ãj . The prototypes are distributions in the auxiliary space. The average distortion will be minimized by parameterizing the functions yj ( x ) and optimizing the distortion with respect to the cluster membership parameters and the prototypes. When the parameters of yj ( x ) are denoted by µj , the average distortion can be written as E KL
D ¡
XZ i, j
[yj ( x I µj ) log yji ]p( ci , x ) dx C const.,
(2.3)
where the constant is independent of the parameters. The membership functions yj ( x I µj ) can be interpreted as conditional densities p ( vj | x ) ´ yj ( x ) of a multinomially distributed random variable V that indicates the cluster identity. The value of the random variable V will be denoted by v 2 fvj g, and the value of the random variable C corresponding to the multinomially distributed auxiliary distribution will be denoted by c 2 fci g. Given x , the choice of the cluster v does not depend on the c. In other words, C and V are conditionally independent: p ( c, v | x ) D p( c | x ) p ( v | x ) . R It follows that p ( c, v ) D p( c | x ) p ( v | x ) p ( x ) dx. It can be shown (see appendix A) that if the membership distributions are of the normalized exponential form exp f ( x I µj ) yj ( xI µj ) D P , l exp f ( x I µ l )
(2.4)
then the gradient of EKL with respect to the parameters µl becomes @EKL @µ l
D
X Z @f ( x I µ l ) i, j
@µ l
log
yji p ( ci , vj , vl , x ) dx . Ã li
(2.5)
The prototypes Ãj are probabilities of multinomial distributions, and P therefore they must fulll 0 · yji · 1 and i yji D 1. We will incorporate these conditions into our model by reparameterizing the prototypes as follows: X log yji ´ c ji ¡ log (2.6) ec jm . m
The gradient of the average distortion, equation 2.3, with respect to the new parameters of the prototypes is @EKL @c lm
D
XZ i
( ylm ¡ dmi ) p( ci , vl , x ) dx ,
(2.7)
where the Kronecker symbol dmi D 1 when m D i, and dmi D 0 otherwise.
Clustering in an Auxiliary Space
221
The average distortion can be minimized with stochastic approximation, by sampling from yj ( x ) yl ( x ) p ( ci , x ) D p( vj , vl , ci , x ) . This leads to an on-line algorithm in which the following steps are repeated for t D 0, 1, . . . with a( t) gradually decreasing toward zero: 1. At the step t of stochastic approximation, draw a data sample (x ( t) , c( t) ). Assume that the value of c( t) is ci . This denes the value of i in the following steps. 2. Draw two basis functions, j and l, from the multinomial distribution with probabilities fy k ( x ( t) )g k. 3. Adapt the parameters µ l and c lm , m D 1, . . . , Nc , by µ ¶ yji @f ( x I µ l ) log µl ( t C 1) D µl ( t) ¡ a( t) yli @µl l D l ( t) c lm ( t C 1) D c lm ( t) ¡ a( t) ( ylm ¡ dmi ) ,
(2.8) (2.9)
where Nc is the number of possible values of the random variable C. Due to the symmetry between j and l, it is possible to adapt the parameters twice for one t by swapping j and l in equations 2.8 and 2.9 for the second adaptation. Note that µ l ( t C 1) D µ l ( t) if j D l. the a should fulll the conditions P In stochastic approximation, P 2 < 1. In practice, we have used piecewise( ) ) a(t) 1 and a( D t t t linear decreasing schedules. We will consider two special cases. In the demonstrations in Figure 1, the basis functions are normalized gaussians in the Euclidean space X D Rn . In the second case in section 3, gene expression data mapped onto a hypersphere, X D Sn are clustered by using normalized von Mises–Fisher distributions (Mardia, 1975) as the basis functions. For gaussians parameterized by their locations µ l and having a diagonal covariance matrix in which the variance s 2 is equal in each dimension, f ( x I µl ) D ¡kx ¡ µ l k2 / 2s 2 , and @f ( xI µ l ) @µ l
D
1 ( x ¡ µl ). s2
(2.10)
The von Mises–Fisher (vMF) distribution is an analog of the gaussian distribution on a hypersphere (Mardia, 1975) in that it is the maximum entropy distribution when the rst two moments are xed. The density of an n-dimensional vMF distribution is vMF( x I µ ) D
1 xT µ exp k . kµ k Zn (k )
(2.11)
Here, the parameter vector µ represents the mean direction vector. The 1 1 normalizing coefcient Zn (k ) ´ ( 2p ) 2 n I 1 n¡1 (k ) /k 2 n¡1 is not relevant here, 2
222
Janne Sinkkonen and Samuel Kaski
Figure 1: The location parameters µj (small circles) of the gaussian basis functions of two models, optimized for two-class (n( c) D 2), two-dimensional, and three-dimensional data sets. The shades of gray at the background depict densities from which the data were sampled. (A) Two-dimensional data, with roughly gaussian p ( x ) . The inset shows the conditional density p( c0 | x ) that is monotonically decreasing as a function of the y-dimension. The model has learned to represent only the dimension on which p( c | x ) changes. (B, C) Two projections of three-dimensional data, with a symmetric gaussian p ( x ) (ideally, the form of this distribution should not affect the solution). The insets show the conditional density p ( c0 | x ), which decreases monotonically as a function of a two-dimensional radius and stays constant with respect to the orthogonal third dimension z. The one-dimensional cross section describing p( c 0, x ) and p ( c1 , x ) as a function of the two-dimensional radius is shown in the inset of C. The model has learned to represent only variation in the direction of the radius and along the dimensions x and y, and discards the dimension z as irrelevant.
Clustering in an Auxiliary Space
223
but it will be used in the mixture model of section 3. The function Ir (k ) is the modied Bessel function of the rst kind and order r. In the clustering algorithm described, we use vMF basis functions, f ( x I µ l ) D k x T µ l / kµ l k,
(2.12)
with a constant dispersion k . Then, @f ( xI µ l ) @µ l
2 D k ( x ¡ xT µl µl / kµl k ) / kµ l k.
The norm kµ l k does not affect f , and we may normalize µ l , whereby the gradient becomes @f ( xI µ l ) @µ l
D k ( x ¡ xT µ l µ l ) .
(2.13)
It can be shown (Kaski, 2000) that the stochastic approximation algorithm dened by equations 2.8, 2.9, and 2.13 converges with probability one when the µj are normalized after each step. 2.1 Connections to Mutual Information and Density Estimation. It can be easily shown (using the information inequality) that at the minimum of the distortion EKL , the prototype à l takes the form
yli D p( ci | vl ) .
(2.14)
Hence, the distortion 2.3 can be expressed as EKL D ¡
X
p ( ci , vj ) log
i, j
p ( ci , vj ) C const. p ( ci ) p( vj )
D ¡I( CI V ) C const.,
(2.15)
I ( CI V ) being the mutual information between the random variables C and V. Thus, minimizing the average distortion EKL is equivalent to maximizing the mutual information between the auxiliary variable C and the cluster memberships V. Minimization of the distortion has a connection to density estimation as well; the details are described in appendix B. It can be shown that minimization of EKL minimizes an upper limit of the mean Kullback-Leibler divergence between the real distribution p( c | x ) and a certain estimate pO( c | x ). The estimate is pO( ci | x ) D
X 1 exp yj ( x I µj ) log yji , Z( x )
(2.16)
j
where Z ( x ) is a normalizing coefcient selected such that
P i
pO( ci | x ) D 1.
224
Janne Sinkkonen and Samuel Kaski
For gaussian basis functions, the upper limit becomes tight when s approaches zero and the cluster membership functions become binary. Then the cluster membership functions approach indicator functions of Voronoi regions. Note that the solution cannot be computed in practice for s D 0; smooth vector quantization with s > 0 results in a compromise in which the solution is tractable, but only an upper limit of the error will be minimized. Furthermore, the mean divergence can be expressed in terms of the divergence of joint distributions as follows: E X fDKL ( p ( c | x ) , pO( c | x ) ) g D DKL ( p ( c, x ) , pO( c | x ) p ( x ) ).
(2.17)
Here EX f¢g denotes the expectation over values of X, and the latter DKL is the divergence of the joint distribution. The expression 2.17 is a cost function of an estimate of the conditional probability p ( c | x ) . Intuitively speaking, by minimizing equation 2.17, resources are not wasted in estimating the marginal p ( x ) , but all resources are concentrated on estimating p( c | x ) . It can further be shown (see appendix B) that maximum likelihood estimation of the model pO( c | x ) using a nite data set is asymptotically equivalent to minimizing equation 2.17. 2.2 Related Works
2.2.1 Competitive Learning. The early work on competitive learning or adaptive feature detectors (Didday, 1976; Grossberg, 1976; Nass & Cooper, 1975; PÂerez, Glass, & Shlaer, 1975) has a close connection to vector quantization (Gersho, 1979; Gray, 1984; Makhoul, Roucos, & Gish, 1985) and K-means clustering (Forgy, 1965; MacQueen, 1967). The neurons in a competitivelearning network are parameterized by vectors describing the synaptic weights, denoted by mj for neuron j. In the simplest models, the activity of a neuron due to external inputs is a nonlinear function f of the inputs x multiplied by the synaptic weights, f ( x T mj ) . The activities of the neurons compete; the activity of each neuron reduces the activity of the others by negative feedback. If the competition is of the winner-take-all type (Kaski & Kohonen, 1994), only the neuron with the largest f ( xT mj ) remains active. Each neuron therefore functions as a feature detector that detects whether the input comes from a particular domain of the input space. During Hebbian-type learning, the neurons gradually specialize in representing different types of domains. (For recent more detailed accounts, see Kohonen, 1984, 1993; Kohonen & Hari, 1999.) Although the winner is usually dened in terms of inner products, it is possible to generalize the model to an arbitrary metric. If the usual Euclidean metric is used, the learning corresponds to minimization of a mean-squared vector quantization distortion or, equivalently, minimization of the distance
Clustering in an Auxiliary Space
225
to the closest cluster center in the K-means clustering. The domain of the input space that a neuron detects can hence be interpreted as a Voronoi region in vector quantization. The relationship of competitive learning to our work is that the “cluster membership functions” yj ( x I µj ) in section 2 may be interpreted as the outputs of a set of neurons, and in the limit of crisp membership functions (for gaussians, s ! 0), only one neuron—the one having the largest external input—is active and can be interpreted as the winner. After learning, our algorithm therefore corresponds to a traditional competitive network. The learning procedure makes the difference by making the network detect features that are as homogeneous as possible with regard to the auxiliary data. The learning algorithm has a potentially interesting relation to Hebbian or competitive learning as well. Assume that at most two of the neurons may become active at a time, with probabilities yj ( x I µj ) . Then the learning algorithm, equation 2.8, for vMF kernels reads
µl ( t C 1) D µl ( t) C a( t) ( x ¡ xT µ l µl ) ( log yli ¡ log yji ) (neuron j is adapted at the same time, swapping j and l). If the activity of the neurons is binary valued, that is, the neurons j and l have activity value one and the others value zero, then the adaptation rule for any neuron k can be expressed by
µ k ( t C 1) D µ k ( t) C a( t)gk ( t) ( x ¡ x T µ kµ k ) ( log yki ¡ log yji ) .
(2.18)
Here gk ( t) denotes the activity of the neuron k. The term gk ( t) x is Hebbian, whereas g k ( t ) x T µ kµ k is a kind of a forgetting term (cf. Kohonen, 1984; Oja, 1982). The difference from common competitive learning then lies within the last parentheses in equation 2.18. The parameter vector of the neuron of the active pair ( j, l) that represents better the class i has a larger value of log y and is moved toward the current sample, whereas the other neuron is moved away from the sample. (Note also the similarity to the learning vector quantization algorithms, see e.g. Kohonen, 1995.) Note that for normalized gaussian membership functions yj ( xI µj ) and 6 s D 0, our model is a kind of a variant of gaussian mixture models or soft vector quantization. At the limit of crisp feature detectors (for gaussians, s ! 0), the output of the network reduces to a 1-of-C-coded discrete value. Similarly, the outputs of the soft version can be interpreted as probability density functions of a multinomial random variable. Such an interpretation has already been made by Becker in some of her work, discussed in section 2.2.2. 2.2.2 Multinomial Variables. Becker et al. (Becker & Hinton, 1992; Becker, 1996) have introduced a learning goal for neural networks called Imax. Their networks consist of two separate modules having different inputs,
226
Janne Sinkkonen and Samuel Kaski
and the learning algorithms aim at maximizing the mutual information between the outputs of the modules. For example, if the inputs are twodimensional arrays of random dots with stereoscopic displacements simulating the views of two eyes, the networks are able to infer depth from the data. The variant called discrete Imax (Becker, 1996) is closely related to the clustering algorithm of this article. In Imax, the outputs of the neurons in each module are interpreted as the probabilities of a multinomial random variable, and the goal of learning is to maximize the mutual information between the variables of the two modules. Our model differs from Becker ’s in two ways. First, Becker uses (normalized) exp( xT µj ) as basis functions, whereas our parameterization makes the basis functions invariant to the norms of µj (cf. equation 2.11). Without such invariance, the units with the largest norms may dominate the representation, a phenomenon that Becker noted as well. The other difference is that Becker optimizes the model using gradient descent based on the whole batch of input vectors, whereas we have a simple on-line algorithm, 2.18, adapting on the basis of one data sample at a time. The gradient of the discrete Imax with respect to the parameters µ l is, after simplication and in our notation, XZ p ( ci | vj ) @I x log (2.19) D p ( vj , vl , ci , x ) dx. @µj p( ci | vl ) i, j
It would be possible to apply stochastic approximation here by sampling from p ( vj , vl , ci , x ) , which leads to a different adaptation rule from ours. Becker (1996) has also used gaussian basis functions, but with some approximations and ending up with a different formula for the gradient. 2.2.3 Continuous Variables. The mutual information between continuously valued outputs of two neurons can be maximized as well (Becker & Hinton, 1992; Becker, 1996). Some assumptions about the continuously valued signals and the noise have to be made, however. In Becker and Hinton (1992), the outputs were assumed to consist of gaussian signals corrupted by independent, additive gaussian noise. In this article, the multinomial Imax has been reinterpreted as (soft) vector quantization in the Kullback-Leibler “metric” in which the distance is measured in an auxiliary space. In neural terms, the model builds a representation of the input space: each neuron is a detector specialized to represent a certain domain. In contrast, the continuous version tries to represent the input space by generating a parametric transformation to a continuous, lower-dimensional output space. If the parameterization and the assumptions about the noise are correct, the continuous representations are potentially more accurate. The advantage of the quantized representation is that no such assumptions need to be made; the model is almost purely
Clustering in an Auxiliary Space
227
data driven (semiparametric) and, of course, very useful for clustering-type applications.1 It is particularly difcult to maximize the mutual information if each module has several continuously valued outputs. In some recent works (Fisher & Principe, 1998; Torkkola & Campbell, 2000), the Shannon entropy has been replaced by the quadratic Renyi entropy, yielding simpler formulas for the mutual information. 2.2.4 Information Bottleneck and Distributional Clustering. In distributional clustering works (Pereira et al., 1993) with the information bottleneck principle (Tishby et al., 1999), mutual information between two discrete variables has been maximized. Tishby et al. get their motivation from the rate distortion theory of Shannon and Kolmogorov (see Cover & Thomas, 1991, for a review). In the rate distortion theory, the aim is to nd an optimal code book for a set of discrete symbols when a “cost” in the form of a distortion function describing the effects of a transmission line is given. In our notation, the authors consider the problem of building an optimal representation V for a discrete random variable X. In the rate distortion theory, a real-valued distortion function d( x, v ) is assumed known, and I ( XI V ) is minimized with respect to the representation (or conventionally, the code book) p( v | x) subject to the constraint EX,V fd( x, v )g < k. At the minimum, the conditional distributions dening the code book are p ( vl ) exp [¡bd( x, vl ) ] £ ¤, p ( vl | x ) D P ( ) ) j p vj exp ¡bd( x, vj
(2.20)
where b depends on k. The authors realized that if the average distortion EX,V fd( x, v ) g is replaced by the mutual information ¡I( CI V ), then the rate distortion theory gives a solution that captures as much information of the “relevance variable” C as possible. Here, the multinomial random variable C has the same role as our auxiliary data. The functional to be minimized becomes I ( XI V ) ¡ bI( CI V ), and its variational optimization with respect to the conditional densities p ( v | x) leads to the solution 2.20 with d( x, vj ) D DKL ( p ( c | x ) , p ( c | vj ) ) .
(2.21)
Together, these two equations give a characterization of the optimal representation V once we accept the criterion I ( XI V ) ¡ bI( CI V ) for the goodness of the representation. The characterization is self-referential through p ( c | v) and therefore does not in itself present an algorithm for nding the p ( v | x ) 1 We have made some assumptions by parameterizing the f (xI h) in equation 2.4. However, the model becomes semiparametric as a scale parameter, similar to the s for gaussians and 1 /k for the vMF kernels, approaches zero.
228
Janne Sinkkonen and Samuel Kaski
and p ( c | v ) , but Tishby et al. (1999) introduced an algorithm for nding a solution in the case of a multinomial X. Like the method presented in this article, the bottleneck aims at revealing nonlinear relationships between two data sets by maximizing a mutual information-based cost function. The relation between the two approaches is that although we started from the clustering viewpoint, our error criterion EKL turned out to be equivalent to the (negative) mutual information I ( CI V ) . The bottleneck has an additional term in its error function for keeping the complexity of the representation low, while the complexity of our clusters is restricted by their number and their parameterization. The most fundamental difference between our clustering approach and the published bottleneck works, however, arises from the continuity of our random variable X. The theoretical form of the bottleneck principle, equation 2.21, is not limited to discrete or nite spaces. According to our knowledge, however, no continuous applications of the principle have so far been published. For a continuous X, the distortion d ( x, v) in equation 2.21 cannot be readily evaluated without some additional assumptions, such as restrictions to the form of the cluster memberships p ( v | x ) . Our solution is to parameterize p ( v | x ) , which allows us to optimize the partitioning of the data space X into (soft) clusters. 2 3 Case Study: Clustering of Gene Expression Data
We tested our approach by clustering a large, high-dimensional data set, i.e., expressions of the genes of the budding yeast Saccharomyces cerevisiae in various experimental conditions. Such measurements, obtained from socalled DNA chips, are used in functional genomics to infer similarity of function of different genes. There are two popular approaches for analyzing expression data: traditional clustering methods (see, e.g., Eisen, Spellman, Brown, & Botstein, 1998) and supervised classication methods (support vector machines; Brown et al., 2000). In this case study, we intend to show that our method has the good sides of both of the approaches. For the majority of yeast genes, there exists a functional classication based on biological knowledge. The goal of the supervised classiers is to learn this classication in order to predict functions for new genes. The classiers may additionally be useful in that the errors they make on the genes having known classes may suggest that the original functional classication has errors. The traditional unsupervised clustering methods group solely on the basis of the expression data and do not use the known functional classes.
2 Alternatively, instead of solving a continuous problem, X could be (suboptimally) partitioned into predened clusters, after which the standard distributional clustering algorithms are applicable.
Clustering in an Auxiliary Space
229
Hence, they are applicable to sets of genes without known classication, and they may additionally generate new discoveries. There may be hidden similarities between the classes in the hierarchical functional classication, and there may even exist new subclasses that are revealed as more experimental data are collected. The clustering methods can therefore be used as hypothesis-generating machines. The disadvantage of the clustering algorithms is that the results are determined by the metric used for measuring similarity of the expression data. The metric is always somewhat arbitrary unless it is based on a considerable amount of knowledge about the functioning of the genes. Our goal is to use the known functional classication to dene implicitly which aspects of the expression data are important. The clusters are local in the expression data space, but the prototypes are placed to minimize the average distortion 2.2 in the space of the functional classes. The difference from supervised classication methods is that while classication methods cannot surpass the original classes, the (supervised) clusters are not tied to the classication and may reveal substructures within and relations between the known functional classes. In this case study, we compare our method empirically with alternative methods and demonstrate its convergence properties and the potential usefulness of the results. More detailed biological interpretation of the results will be presented in subsequent articles. We compared our model with two standard state-of-the-art mixture density models. The rst is a totally unsupervised mixture of vMF distributions. The model is analogous to the usual mixture of gaussians; the gaussian mixture components are simply replaced by the vMF components. The model is p( x ) D
X
p ( x | vj ) pj ,
(3.1)
j
where p ( x | vj ) D vMF( x I µj ) , and vMF is dened in equation 2.11. The pj are the mixing parameters. In the second model, mixture discriminant analysis 2 (MDA2; Hastie, Tibshirani, & Buja, 1995), the joint distribution between the functional classes c and the expression data x is modeled by a set of additive components denoted by uj : p ( ci , x ) D
X
p ( ci | uj ) p( x | uj ) pj ,
(3.2)
j
where p ( ci | uj ) and pj are parameters to be estimated, and p ( x | uj ) D vMF( xI µj ). Both models are tted to the data by maximizing their log likelihood with the expectation-maximization algorithm (Dempster, Laird, & Rubin, 1977).
230
Janne Sinkkonen and Samuel Kaski
3.1 The Data. Temporal expression patterns of 2476 genes of the yeast were measured with DNA chips in nine experimental settings (for more details, see Eisen et al., 1998; the data are available online at http://rana. Stanford.edu/clustering/). Each sample measures the expression level of a gene compared to the expression in a reference state. Altogether, there were 79 time points for each gene, represented below by the feature vector x. The data were preprocessed in the same way as in Brown et al. (2000)— by taking logarithms of the individual values and normalizing the length of x to unity. The data were then divided into a training set containing twothirds of the samples and a test set containing the remaining third. All the reported results except those reported in Table 1 are computed for the test set. The functional classication was obtained from the Munich Information Center for Protein Sequences Yeast Genome Database (MYGD).3 The classication system is hierarchic, and we chose to use the 16 highest-level classes to supervise the clustering. Sample classes include metabolism, transcription, and protein synthesis. Some genes belonged to several classes. Seven genes were removed because of a missing classication at the highest level of the hierarchy. 3.2 The Experiments. We rst compared the performance of the three models—the mixture of gaussians, MDA2, and our own—after the algorithms had converged. All models had 8 clusters and were run until there was no doubt on convergence. The mixture of gaussians and MDA2 were run for 150 epochs through the whole data and our model for 4.5 million stochastic iterations (a(t) decreased rst with a piecewise-linear approximation to an exponential curve, and then linearly to zero in the end). All models were run three times with different randomized initializations, and the best of the three results was chosen. We measured the quality of the resulting clusterings by the average distortion error or, equivalently, the empirical mutual information. When estimating the empirical mutual information, the table of the joint distributions p( ci , vj ) is rst estimated. In our model, the ith row of the table is updated by p ( vj | x ) D yj ( xI µj ) (equation 2.4 with f dened by equation 2.12) for each sample ( ci , x ). In the gaussian mixture model, the update is p( vj | x ) D p ( x | vj ) pj / p ( x ) , and in MDA2 it is p ( uj | x ) D p( x | uj ) pj / p ( x ) . After the table p ( ci , vj ) is computed, the performance criterion is obtained from equation 2.15 without the constant. 4 3
http://www.mips.biochem.mpg.de/proj/yeast. The empirical mutual information is an upward-biased estimate of the real mutual information, and the bias grows with decreasing data. Because the size of our sample is rather large and constant across the compared distributions and the number of values of the discrete variables is small, the bias does not markedly affect the results. 4
Clustering in an Auxiliary Space
231
Figure 2: Empirical mutual information between the generated gene expression categories and the functional classes of the genes, as a function of parameter k , which governs the width of the basis functions. Solid line: our model; dashed line: mixture of vMFs; dotted line: MDA2.
The results are shown in Figure 2 for different values of the parameter k that governs the width of the vMF kernels, equation 2.11. Our model clearly outperforms the other models for a wide range of widths of the kernels and produces the best overall performance; the clusters of our model convey more information about the functional classication of the genes than the alternative models do. There is a somewhat surprising sideline in the results: the gaussian mixture model is about as good as the MDA2, although it does not use the class information at all. The reason is probably in some special property of the data since for other data sets, MDA2 has outperformed the plain mixture model. Next we demonstrate the number of iterations required for convergence. The empirical mutual information is plotted in Figure 3 as a function of the number of iterations. In our model, the schedule for decreasing the coefcient a( t) was the same in each run, stretched to cover the number of iterations and decaying to zero in the end. The number of complete data epochs for the MDA2 was made comparable to the number of stochastic iterations by multiplying it with the number of data and the number of kernels, divided by two (in our algorithm, two kernels are updated at each iteration step). The performance of MDA2 attains its maximum quickly, but our model surpasses MDA2 well before 500,000 iterations (see Figure 3).
232
Janne Sinkkonen and Samuel Kaski
Figure 3: Empirical mutual information as a function of the number of iterations. Solid line: our model; dashed line: MDA2. k D 148.
Finally, we demonstrate the quality of the clustering by showing the distribution of the genes into the clusters from a few functional subclasses, known to be regulated in the experimental conditions like those of our data (Eisen et al., 1998). Note that these subclasses were not included in the values of the auxiliary variable C. Instead, they were picked from the second and third level of the functional gene classication. To characterize all the genes, the learning and the test sets were now combined. In Table 1, each gene is assigned to the cluster having the largest value of the membership function for that gene. The table reveals that many subclasses are concentrated in one of the clusters found by our algorithm. The rst four subclasses (a–d) belong to the same rst-level class and are placed in the same cluster, number 2. For comparison, the distribution of the same subset of genes into clusters formed by the mixture of vMFs and MDA2 is shown in Table 2. The concentratedness of the classes in different clusters can be summarized by the empirical mutual information within the table; the mutual information is 1.2 bits for our approach and 0.92 for the other two. In Table 1, produced by our method, three of the subclasses (c, e, and f) have been clearly divided into two clusters, suggesting a possible biologically interesting division. Its relevance will be determined later by further biological inspection; in this article, our goal is to demonstrate that the semisupervised clustering approach can be used to explore the data set and provide potential further hypotheses about its structure.
Clustering in an Auxiliary Space
233
Table 1: Distribution of Genes (Learning and Test Set Combined) of Sample Functional Subclasses into the Eight Clusters Obtained with Our Method. Cluster Number Class a b c d e f g
1
2
3
4
5
6
7
8
0 0 1 0 122 3 0
1 1 5 3 1 3 1
6 16 39 8 0 1 0
0 0 1 1 2 20 21
1 0 1 0 0 0 0
0 0 4 0 2 46 0
0 0 14 0 44 2 0
0 0 3 2 2 12 7
Note: These subclasses were not used in supervising the clustering. a: the pentose-phosphate pathway; b: the tricarboxylic acid pathway; c: respiration; d: fermentation; e: ribosomal proteins; f: cytoplasmic degradation; g: organization of chromosome structure.
Table 2: Distribution of Genes (Learning and Test Set Combined) of Sample Functional Subclasses into the Eight Clusters Obtained by the Mixture of vMFs model and MDA2. Cluster Number Class a b c d e f g
1
2
3
4
5
6
7
8
0 1 3 0 0 42 3
3 1 16 9 6 12 1
1 0 2 0 1 6 10
0 0 5 2 4 8 5
0 0 23 0 32 0 0
2 14 14 0 1 4 0
1 0 4 2 125 4 2
1 1 1 1 4 11 8
Note: Both methods yield the same table for these subclasses. For an explanation of the classes, see Table 1.
4 Conclusion
We have described a soft clustering method for continuous data that minimizes the within-cluster distortion between distributions of associated, discrete auxiliary data. The approach was inspired by our earlier work in which an explicit density estimator was used to derive an information-geometric metric for similar kinds of clustering tasks (Kaski et al., in press). The method presented here is conceptually simpler and does not require explicit density estimation, which is known to be difcult in high-dimensional spaces. The task is analogous to that of distributional clustering (Pereira et al., 1993) of multinomial data with the information bottleneck method (Tishby
234
Janne Sinkkonen and Samuel Kaski
et al., 1999), or learning from dyadic data (Hofmann et al., 1998). The main difference from our method is that these works operate on a discrete and nite data space, while our data are continuous. Our setup and cost function have connections to the information bottleneck method, but the approaches are not equivalent. We showed that minimizing our Kullback-Leibler divergence-based distortion criterion is equivalent to maximizing the mutual information between (neural) representations of the inputs and a discrete variable studied by Becker (Becker & Hinton, 1992; Becker, 1996). The distortion was additionally shown to be bounded by a conditional likelihood, which it approaches in the limit where the clusters sharpen toward Voronoi regions. We derived a simple on-line algorithm for optimizing the distortion measure. The convergence of the algorithm is proven for vMF basis functions in Kaski (2000). The algorithm was shown to have a close connection to competitive learning. We applied the clustering method to a yeast gene expression data set that was augmented with an independent, functional classication for the genes. The algorithm performs better than other algorithms available for continuous data, the mixture of gaussians and MDA2, a model for the joint density of the expression data and the classes. Our method turned out to be relatively insensitive to the (a priori set) width parameter of the gaussian parameterization, outperforming the competing methods for a wide range of parameter values. It was shown that the obtained clusters mediate information about the function of the genes, and although the results have not yet been biologically analyzed, they potentially suggest novel cluster structures for the yeast genes. Topics of future work include investigation of more exible parameterizations for the clusters, the relationship of the method to the metric dened in our earlier work, and variations of the algorithm toward visualizable clusterings and continuous auxiliary distributions. Appendix A: Gradient of the Error Criterion with Respect to the Parameters of the Basis Functions
If we denote Kil ( x ) ´
X @yj ( x I µj )
log yji ,
(A.1)
Kil ( x ) p ( ci , x ) dx .
(A.2)
@µ l
j
then @EKL @µ l
D ¡
XZ i
Clustering in an Auxiliary Space
235
Since the basis functions yj ( xI µj ) are of the normalized exponential form (see equation 2.4), the gradient of yj is5 @yj ( x I µj ) @µ l
D dlj yj ( x I µj )
@fj ( xI hj ) @µ l
¡ yj ( x I µj ) yl ( xI µ l )
@fl ( x I µl ) @µl
.
(A.3)
Substituting the result into equation A.1 gives 2
3
Kil ( x ) D yl ( x I µ l ) 4log yli ¡ 2 D yl ( x I µ l ) 4
X j
X j
yj ( x Ihj ) log yji 5
@fl ( x I hl ) @µ l
3 yli 5 @fl ( x I µ l ) . yj ( x I hj ) log yji @µ l
(A.4)
Since the V and C are conditionally independent with respect to X, we may write yj ( x I µj ) yl ( x I µ l ) p ( ci , x ) D p( ci , vj , vl , x ).
(A.5)
Substituting this and equation A.4 into A.2, we arrive at equation 2.5. Appendix B: Connection to Conditional Density Estimation
Let us denote yj ( xI µj ) D yj ( x ) for brevity, R Pand note that the conditional ( ) ( ) entropy of C given X, or H ( C | X ) D ¡ i p c i , x log p ci | x dx , is independent of the parameters µj and yji . The distortion of equation 2.3 can then be expressed as EKL D ¡ D D D
XZ i, j
Z X i
Z X
yj ( x ) log yji p ( ci , x ) dx ¡ H ( C | X ) C const.
2
i
Z X i
3
4log p ( ci | x ) ¡ p ( ci | x ) log p ( ci | x ) log
(B.1)
X
exp
j
yj ( x ) log yji5 p ( ci , x ) dx C const.
p( c | x ) P i p ( x ) dx C const. ( ) j yj x log yji
p ( ci | x ) p ( x ) dx C const., qi ( x )
(B.2)
(B.3) (B.4)
5 Note that the normalized versions of the densities of the so-called exponential family are included—if the yj (x I µj ) are interpreted as densities of the random variables p(vj | x ).
236
Janne Sinkkonen and Samuel Kaski
where qi ( x ) ´ exp
X
yj ( x ) log p ( ci | vj ) .
The fqi ( x ) gi is not a proper density. However, inequality. Hence, for all x , X
(B.5)
j
p ( ci | x ) log
i
P
i qi ( x )
· 1 based on Jensen’s
X p ( ci | x ) p ( ci | x ) p( x ) ¸ p ( ci | x ) log p( x) , qi ( x ) pO( ci | x ) i
(B.6)
where qi ( x ) . pO( ci | x ) ´ P i qi ( x )
(B.7)
Therefore, minimizing the clustering criterion EKL minimizes an upper limit of Z X p ( ci | x ) p( ci | x ) log p( x ) dx D E X fDKL ( p ( c | x ) , pO( c | x ) ) g, (B.8) pO( ci | x ) i the expected KL divergence between the conditional density p( c | x ) and its estimate pO( c | x ) . (Here EX f¢g denotes the expectation over x .) One can also write Z X p ( ci , x ) E X fDKL ( p ( c | x ) , pO( c | x ) ) g D p ( ci , x ) log dx pO( ci | x ) p ( x ) i D DKL ( p ( c, x ) , pO( c | x ) p ( x ) ).
(B.9)
Maximizing the likelihood of the model pO( ck | x k ) for data f( ck, x k )g k sampled from p ( c, x ) is asymptotically equivalent to minimizing the KullbackLeibler divergence, equation B.9. This is because for the N independent and identically distributed samples f( ck, x k ) gk, the scaled log likelihood, 1 X log pO( ck | x k ) , N k
converges to Z X i
p( ci , x ) log pO( ci | x ) dx D ¡
Z X i
p ( ci | x ) log
p ( ci | x ) p ( x ) dx C const. pO( ci | x )
D ¡EX fDKL ( p ( c | x ) , pO( c | x ) ) g C const.
(B.10)
Clustering in an Auxiliary Space
237
with probability 1 when N ! 1, and because maximizing equation B.10 minimizes equation B.9. In the special case when the yj ( x ) are binary valued, the qi ( x ) D pO( ci | x ) is a proper density and the equality in equation B.6 holds for almost every x. Therefore, the minimization of (an approximation of) the average distortion EKL on a nite data set is equivalent to maximizing the likelihood of pO( c | x ) on the same sample. Acknowledgments
This work was supported by the Academy of Finland, in part by grant 50061. We thank Janne Nikkil¨a, Petri Tor ¨ onen, ¨ and Jaakko Peltonen for their help with the simulations and processing of the data. References Becker, S. (1996). Mutual information maximization: Models of cortical selforganization. Network: Computation in Neural Systems, 7, 7–31. Becker, S., & Hinton, G. E. (1992). Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 355, 161–163. Brown, M. P., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C. W., Furey, T. S., Ares, Jr., M., & Haussler, D. (2000). Knowledge-based analysis of microarray gene expression data by using support vector machines. Proceedings of the National Academy of Sciences, USA, 97, 262–267. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1–38. Didday, R. L. (1976). A model of visuomotor mechanisms in the frog optic tectum. Mathematical Biosciences, 30, 169–180. Eisen, M. B., Spellman, P. T., Brown, P. O., & Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences, USA, 95, 14863–14868. Fisher III, J. W., & Principe, J. (1998). A methodology for information theoretic feature extraction. In Proc. IJCNN’98, International Joint Conference on Neural Networks (pp. 1712–1716). Piscataway, NJ: IEEE Service Center. Forgy, E. W. (1965). Cluster analysis of multivariate data: Efciency vs. interpretability of classications. Biometrics, 21, 768–769. Gersho, A. (1979). Asymptotically optimal block quantization. IEEE Transactions on Information Theory, 25, 373–380. Gray, R. M. (1984, April). Vector quantization. IEEE ASSP Magazine, 4–29. Grossberg, S. (1976). On the development of feature detectors in the visual cortex with applications to learning and reaction-diffusion systems. Biological Cybernetics, 21, 145–159.
238
Janne Sinkkonen and Samuel Kaski
Hastie, T., Tibshirani, R., & Buja, A. (1995). Flexible discriminant and mixture models. In J. Kay & D. Titterington (Eds.), Neural networks and statistics. New York: Oxford University Press. Hofmann, T. (2000). Learning the similarity of documents: An informationgeometric approach to document retrieval and categorization. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12 (pp. 914–920). Cambridge, MA: MIT Press. Hofmann, T., Puzicha, J., & Jordan, M. I. (1998). Learning from dyadic data. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 466–472). San Mateo, CA: Morgan Kaufmann. Jaakkola, T. S., & Haussler, D. (1999). Exploiting generative models in discriminative classiers. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processingsystems,11 (pp. 487–493). San Mateo, CA: Morgan Kaufmann. Kaski, S. (2000). Convergence of a stochastic semisupervised clustering algorithm (Tech. Rep. No. A62). Espoo, Finland: Helsinki University of Technology. Kaski, S., & Kohonen, T. (1994). Winner-take-all networks for physiological models of competitive learning. Neural Networks, 7, 973–984. Kaski, S., Sinkkonen, J., & Peltonen, J. (in press). Bankruptcy analysis with selforganizing maps in learning metrics. IEEE Transactions on Neural Networks. Kohonen, T. (1984). Self-organization and associative memory. Berlin: SpringerVerlag. Kohonen, T. (1993). Physiological interpretation of the self-organizing map algorithm. Neural Networks, 6, 895–905. Kohonen, T. (1995). Self-organizing maps. Berlin: Springer. Kohonen, T., & Hari, R. (1999). Where the abstract feature maps of the brain might come from. TINS, 22, 135–139. MacQueen, J. (1967). Some methods for classication and analysis of multivariate observations. In L. M. Le Cam & J. Neyman (Eds.), Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Vol. 1: Statistics (pp. 281–297). Berkeley: University of California Press. Makhoul, J., Roucos, S., & Gish, H. (1985). Vector quantization in speech coding. Proceedings of the IEEE, 73, 1551–1588. Mardia, K. V. (1975). Statistics of directional data. Journal of the Royal Statistical Society B 37, 349–393. Nass, M. M., & Cooper, L. N. (1975). A theory for the development of feature detecting cells in visual cortex. Biological Cybernetics, 19, 1–18. Nowlan, S. J. (1990). Maximum likelihood competitive learning. In D. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 574–582). San Mateo, CA: Morgan Kaufmann. Oja, E. (1982). A simplied neuron model as a principal component analyzer. J. Math. Biology, 15, 267–273. Pereira, F., Tishby, N., & Lee, L. (1993). Distributional clustering of English words. In Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics (pp. 183–190). PÂerez, R., Glass, L., & Shlaer, R. J. (1975). Development of specicity in cat visual cortex. Journal of Mathematical Biology, 1, 275–288.
Clustering in an Auxiliary Space
239
Tipping, M. E. (1999). Deriving cluster analytic distance functions from gaussian mixture models. In Proc. ICANN99, Ninth International Conference on Articial Neural Networks (pp. 815–820). London: IEE. Tishby, N., Pereira, F. C., & Bialek, W. (1999). The information bottleneck method. In 37th Annual Allerton Conference on Communication, Control, and Computing. Urbana, IL. Torkkola, K. & Campbell, W. (2000). Mutual information in learning feature transformations. In Proc. ICML’2000, the 17th International Conference on Machine Learning (pp. 1015–1022). San Mateo, CA: Morgan Kaufmann. Received June 28, 2000; accepted April 7, 2001.
ARTICLE
Communicated by Peter Bartlett
On the Complexity of Computing and Learning with Multiplicative Neural Networks Michael Schmitt
[email protected] ¨ Mathematik, Ruhr-Universit¨at Lehrstuhl Mathematik und Informatik, Fakult¨at fur Bochum, D–44780 Bochum, Germany
In a great variety of neuron models, neural inputs are combined using the summing operation. We introduce the concept of multiplicative neural networks that contain units that multiply their inputs instead of summing them and thus allow inputs to interact nonlinearly. The class of multiplicative neural networks comprises such widely known and wellstudied network types as higher-order networks and product unit networks. We investigate the complexity of computing and learning for multiplicative neural networks. In particular, we derive upper and lower bounds on the Vapnik-Chervonenkis (VC) dimension and the pseudodimension for various types of networks with multiplicative units. As the most general case, we consider feedforward networks consisting of product and sigmoidal units, showing that their pseudo-dimension is bounded from above by a polynomial with the same order of magnitude as the currently best-known bound for purely sigmoidal networks. Moreover, we show that this bound holds even when the unit type, product or sigmoidal, may be learned. Crucial for these results are calculations of solution set components bounds for new network classes. As to lower bounds, we construct product unit networks of xed depth with superlinear VC dimension. For sigmoidal networks of higher order, we establish polynomial bounds that, in contrast to previous results, do not involve any restriction of the network order. We further consider various classes of higher-order units, also known as sigma-pi units, that are characterized by connectivity constraints. In terms of these, we derive some asymptotically tight bounds. Multiplication plays an important role in both neural modeling of biological behavior and computing and learning with articial neural networks. We briey survey research in biology and in applications where multiplication is considered an essential computational element. The results we present here provide new tools for assessing the impact of multiplication on the computational power and the learning capabilities of neural networks. Neural Computation 14, 241–301 (2001)
° c 2001 Massachusetts Institute of Technology
242
Michael Schmitt
1 Introduction Neurons compute by receiving signals from a large number of other neurons and processing them in a complex way to yield output signals sent to other neurons again. A major issue in the formal description of single neuron computation is how the input signals interact and jointly affect the processing that takes place further. In a great many neuron models, this combination of inputs is specied using a linear summation. The McCulloch-Pitts model and the sigmoidal neuron are examples of these summing neurons, which are very popular in applications of articial neural networks. Neural network researchers widely agree that there is only a minor correspondence between these neuron models and the behavior of real biological neurons. In particular, the interaction of synaptic inputs is known to be essentially nonlinear (see, e.g., Koch, 1999). In search for biologically closer models of neural interactions, neurobiologists have found that multiplicative-like operations play an important role in single neuron computations (see also Koch & Poggio, 1992; Mel, 1994). For instance, multiplication models nonlinearities of dendritic processing and shows how complex behavior can emerge in simple networks. In recent years, evidence has also accumulated that specic neurons in the nervous systems of several animals compute in a multiplicative way (Andersen, Essick, & Siegel, 1985; Suga, 1990, Hatsopoulos, Gabbiani, & Laurent, 1995; Gabbiani, Krapp, & Laurent, 1999; Anzai, Ohzawa, & Freeman, 1999a, 1999b). That multiplication increases the computational power and storage capacities of neural networks is well known from extensions of articial neural networks where this operation appears in the form of higher-order units (see, e.g., Giles & Maxwell, 1987). A more general type of multiplicative neuron model is the product unit introduced by Durbin and Rumelhart (1989) where inputs are multiplied after they have been raised to some power specied by an adjustable weight. We subsume networks containing units that multiply their inputs instead of summing them under the general concept of multiplicative neural networks and investigate the impact that multiplication has on their computational and learning capabilities. A theoretical tool that quanties the complexity of computing and learning with function classes in general, and neural networks in particular, is the Vapnik-Chervonenkis (VC) dimension. In this article, we provide a theoretical study of the complexity of computing and learning with multiplicative neural networks in terms of this dimension. The VC dimension and related notions, such as the pseudo-dimension and the fat-shattering dimension, are well known to yield estimates for the number of examples required by learning algorithms for neural networks and other hypothesis classes, such that training results in a low generalization error. Using these dimensions, bounds on this sample complexity of learning can be obtained not only for the model of probably approximately correct (pac) learning due to Valiant (1984) (see also Blumer, Ehrenfeucht,
Multiplicative Neural Networks
243
Haussler, & Warmuth, 1989), but also for the more general model of agnostic learning, that is, when the training examples are generated by some arbitrary probability distribution (see, e.g., Haussler, 1992; Maass, 1995a, 1995b; Anthony & Bartlett, 1999). Furthermore, in terms of the VC dimension, bounds have also been established for the sample complexity of on-line learning (Maass & Tura n, 1992) and Bayesian learning (Haussler, Kearns, & Schapire, 1994). The VC dimension, however, is not only useful in the analysis of learning but has also proven to be a successful tool for studying the complexity of computing over, in particular, the real numbers. Koiran (1996) and Maass (1997) employed the VC dimension to establish lower bounds on the size of sigmoidal neural networks for the computation of functions. Further, using the VC dimension, limitations of the universal approximation capabilities of sigmoidal neural networks have been exhibited by the derivation of lower bounds on the size of networks that approximate continuous functions (Schmitt, 2000). Thus, in particular for neural networks, the VC dimension has acquired a wide spectrum of applications in analyzing the complexity of analog computing and learning. There are some known bounds on the VC dimension for neural networks that also partly include multiplicative units. These results are concerned with networks of higher order where the linearly weighted sum of a classical summing unit is replaced by polynomials of a certain degree. All bounds determined thus far, however, are given in terms of the maximum order of the units or require the order to be xed. This is a severe restriction; it not only imposes a bound on the exponent of each individual input but also limits the degree of interaction among different inputs. A network with units computing piecewise polynomial functions with order at most d is known to have VC dimension O ( W 2 d ) where W is the number of network parameters (Goldberg & Jerrum, 1995; Ben-David & Lindenbaum, 1998). For sigmoidal networks of higher order, Karpinski and Macintyre (1997) established the bound O (W 2 k2 log d ) , where k is the number of network nodes. There are some further results for other network types of which we give a more complete account in a later section. They all consider the degree of higher-order units to be restricted. In this article, we derive bounds on the VC dimension and the pseudodimension of product unit networks. In these networks, the exponents are no longer xed but are treated as variable weights. In addition, no restriction is imposed on their order. We show that a feedforward network consisting of product and sigmoidal units has pseudo-dimension at most O ( W 2 k2 ), where W is the number of parameters and k the number of nodes. Hence, the same bound that is known for higher-order sigmoidal networks of restricted degree also holds when product units are used instead of monomials. Moreover, the bound is valid not only when the types of the nodes are xed but also when they can vary between a sigmoidal unit or a product unit. This may be the case, for instance, when a learning algorithm decides which unit type to assign to which node. The results are based on the method of solution
244
Michael Schmitt
set components bounds (see Anthony & Bartlett, 1999), and we derive new such bounds. We use this method also for showing that a network with k higher-order sigmoidal units and W parameters has pseudo-dimension at most O ( W 4 k2 ) , a bound that also includes the exponents as adjustable parameters. Thus, the VC dimension and the pseudo-dimension of these networks cannot grow indenitely in terms of the order, and the exponents can be treated as weights analogous to the coefcients of the polynomials. These results indicate that from the theoretical viewpoint, multiplication can be considered as an alternative to summation without signicantly increasing the complexity of computing and learning with higher-order networks. We further derive bounds for specic classes of higher-order units. Considering a higher-order unit as a network with one hidden layer, we dene these classes in terms of constraints on the network connectivity. On the one hand, we restrict the number of connections outgoing from the input nodes; on the other hand, we put a limit on the number of hidden nodes. We derive various VC dimension bounds for these classes, order dependent as well as independent, and show for some of them that they are asymptotically tight. It might certainly be possible to embed a class of networks entirely into a single network. The results show, however, that smaller bounds arise when the specic properties of the class are taken into account. We also establish a lower bound for product unit networks stating that networks with two hidden layers of product and summing units have a superlinear VC dimension. Finally, we show that the pseudo-dimension of the single product unit and of the class of monomials is equal to the number of input variables. We focus on feedforward networks with a single output node. This means that at least two possible directions are not pursued further here: recurrent networks and networks with multiple output nodes. VC dimension bounds for recurrent networks consisting of summing units, including threshold, sigmoidal, and higher-order units with xed degree, have been established by Koiran and Sontag (1998). Shawe-Taylor and Anthony (1991) give bounds on the sample complexity for networks of threshold units with more than one output node. Since both of these works build on previous bounds for feedforward networks or single-output networks, respectively, the ideas presented here may be helpful for deriving similar results for recurrent or multiple-output networks containing multiplicative units. The article is organized as follows. In section 2 we introduce the necessary terminology and demarcate multiplicative from summing units. We then report on some research in neurobiology that resulted in the use of multiplication for the modeling of biological neural systems. Further, we give a brief review of some learning applications where multiplication, mainly in the form of higher order, has been employed to increase the capabilities of articial neural networks. In section 3 we introduce the denitions of the VC dimension and the pseudo-dimension, and exhibit some close relationships between these two combinatorial characterizations of function classes. In this section we also survey previous results where bounds on
Multiplicative Neural Networks
245
these dimensions have been obtained for neural networks. The new results follow in the two subsequent sections: section 4 contains the calculations of the upper bounds, and lower bounds are derived in section 5. Finally, in section 6 we give a summary of the results in a table and conclude with some remarks and open questions. Proofs of some technically more involved results, which are needed in section 4, are in the appendix. 2 Neural Networks with Multiplicative Units Terminology in neural network literature varies and is occasionally used inconsistently. We introduce the terms and concepts that we shall adhere to throughout this article and present some of the biological and applicationspecic motivations that led researchers to the use of multiplicative units in neural networks. 2.1 Neural Network Terminology. The connectivity of a neural network is given in terms of an architecture, which is a graph with directed edges, or connections, between the nodes. Nodes with no incoming edges are called input nodes; nodes with no outgoing edges are output nodes. All remaining ones are hidden nodes. The computation nodes of an architecture are its output and hidden nodes. Input nodes serve as input variables of the network. The fan-in of a node is the number of connections entering the node; correspondingly, its fan-out is the number of connections leaving it. The architecture of a feedforward network has no cycles. We focus on feedforward networks with one output node such as shown in Figure 1 that are suitable for computing functions with a scalar, that is, one-dimensional, output range. Some specic architectures are said to be layered. In this case, all nodes with equal distance from the input nodes constitute one layer, and edges exist only between subsequent layers. An architecture becomes a neural network when edges and nodes are labeled with weights and thresholds, respectively, as the network parameters. To specify further which function is to be computed by the network, variables must be assigned to the input nodes and units selected for the computation nodes. The types of units studied in this article will be dened in the following section. Finally, when values, that is, in general real numbers, are specied for the network parameters, the network computes a unique function dened by functional composition of its units. 2.2 Neuron Models That Sum or Multiply. The networks we are considering consist of multiplicative as well as standard summing units. First, we briey recall the denitions of the latter. 2.2.1 Summing Units. The three most widely used summing units are the threshold unit, the sigmoidal unit, and the linear unit. Each of them is parameterized by a set of weights w1 , . . . , wn 2 R, where n is the number
246
Michael Schmitt output node computation nodes hidden nodes/layer fan-in fan-out input nodes/layer
Figure 1: Neural network terminology. An architecture is shown with four input nodes, one hidden layer consisting of three hidden nodes, and one output node. Hidden and output nodes are computation nodes. All input nodes have fanout 3; the hidden nodes have fan-in 4. We also consider architectures with fan-out restrictions. In this case, subsequent layers need not be fully connected. When parameters, that is, weights and thresholds, are assigned and node functions are specied by choosing units, the architecture becomes a neural network.
of input variables, and a threshold t 2 R. They compute their output in the form f ( w1 x1 C w2 x2 C ¢ ¢ ¢ C wn xn ¡ t ) , where x1 , . . . , xn are the input variables with domain R and f is a nonlinear function, referred to as the activation function of the unit. The particular choice of the activation function characterizes what kind of unit is supposed. The threshold unit, also known as McCulloch-Pitts neuron (after McCulloch & Pitts, 1943) or perceptron,1 uses for f the sign or Heaviside function sgn ( y ) D 1 if y ¸ 0, and sgn ( y ) D 0 otherwise. The logistic function s ( y ) D 1 / (1 C e ¡y ) is the activation function employed by the sigmoidal unit.2 Finally, we have a linear unit if the activation function is chosen to be the identity, that is, the linear unit plainly outputs the weighted sum w1 x1 C w2 x2 C ¢ ¢ ¢ C wn xn ¡ t. In other words, the linear unit computes an afne function. The activation function also denes the output ranges of the units: f0, 1g, (0, 1), and R for the threshold, sigmoidal, and linear unit, respectively. Linear units as computation nodes in neural networks are in general redundant if 1
Perceptron is a widely used synonym for the threshold unit, although in the original denition by Rosenblatt (1958), the perceptron is introduced as a network of threshold units with one hidden layer (see also Minsky & Papert, 1988). 2 There is a large class of sigmoidal functions of which the logistic function is only one instance. The sigmoidal unit as it is dened here is also referred to as the standard sigmoidal unit.
Multiplicative Neural Networks
247
they feed their outputs only to summing units. For instance, a network consisting solely of linear units is equivalent to a single linear unit. They are mainly used as output nodes of networks that compute or approximate realvalued functions (see the survey by Pinkus, 1999). However, linear units as hidden nodes can help to save network connections. 2.2.2 Multiplicative Units. The simplest type of a multiplicative unit is a monomial, that is, a product xd11 xd22 ¢ ¢ ¢ xdnn , where x1 , x2 , . . . , xn are input variables. Each di is a nonnegative integer referred to as the degree or exponent of the variable xi . The value d1 C ¢ ¢ ¢ C dn is called the order of the monomial. Since monomials restricted to the Boolean domain f0, 1g, where all nonzero exponents are 1 without loss of generality, compute the logical AND function, they are often considered as computing the continuous analog of a logical conjunction. A Boolean conjunction, however, does not necessarily require the use of multiplication because the logical AND can also be computed by a threshold unit. The same holds for the neural behavior known as shunting inhibition, which is sometimes referred to as a continuous AND-NOT operation. In this article, we do not consider conjunction as a multiplicative operation and use monomials as computational models for genuine multiplicative behavior with real-valued input and output. If M1 , . . . , M k are monomials, a higher-order unit is a polynomial w1 M1 C w2 M 2 C ¢ ¢ ¢ C wk M k ¡ t with real-valued weights w1 , . . . , w k and threshold t. It is also well known under the name sigma-pi unit (Rumelhart, Hinton, & McClelland, 1986; Williams, 1986). As in the case of a summing unit, weights and threshold are parameters, but to uniquely identify a higher-order unit, one also has to specify the structural parameter of the unit, that is, the set of monomials fM 1 , . . . , M kg. We call this set the structure of a higher-order unit. A higherorder unit can be viewed as a network with one hidden layer of monomials and a linear unit as output node. For this reason, a higher-order unit is often referred to in the literature as a higher-order network. In such a network, only the connections leading from hidden nodes to the output node have weights, whereas the connections between input and hidden layer are weightless. Clearly, assigning multiplicative weights to input variables does not increase the power of a higher-order unit since due to the multiplicative nature of the hidden nodes, all such weights can be moved forward to the output node. For higher-order units, it is also common, depending on the type of application, to use a threshold or a sigmoidal unit instead of a linear unit as output node. This yields a higher-order unit that computes f ( w1 M1 C w2 M2 C ¢ ¢ ¢ C w k M k ¡ t)
248
Michael Schmitt
with nonlinear activation function f . In case f is the sign function, we have a higher-order threshold unit, or polynomial threshold unit. If the activation function is sigmoidal, we refer to it as a higher-order sigmoidal unit. Finally, we introduce the most general type of multiplicative unit, the product unit. It has the form w
w
wp
1 2 x1 x2 ¢ ¢ ¢ xp
with variables x1 , . . . , xp and weights w1 , . . . , wp . The number p of variables is called the order of the product unit. In contrast to the monomial, the product unit does not have xed integers as exponents but variable ones that may even take on any real number. Thus, a product unit is computationally at least as powerful as a monomial. Moreover, ratios of variables, and thus division, can be expressed using negative weights. This is one reason the product unit was introduced by Durbin and Rumelhart (1989). Another advantage that these and other authors explore is that the exponents of product units can be adjusted automatically, for instance, by gradient-based and other learning methods (Durbin & Rumelhart, 1989; Leerink, Giles, Horne, & Jabri, 1995a, 1995b). We mentioned above the well-known fact that networks of linear units are equivalent to single units. The same holds for networks consisting solely of product units. Unless there is a restriction on the order, such a network can be replaced by a single unit. Therefore, product units are mainly used in networks where they occur together with other types of units, such as threshold or sigmoidal units. For instance, the standard neural network containing product units is an architecure with one hidden layer such as shown in Figure 1, where the hidden nodes are product units and the output node is a sigmoidal unit. A legitimate question is whether multiplicative units are actually needed in neural networks and whether their task can be done by some reasonably sized network of summing units. Indeed, for multiplication and exponentiation of integers in binary representation, networks of threshold units with a small number of layers have been constructed that perform these and other arithmetic operations (for surveys, see Hofmeister, 1994; Siu, Roychowdhury, & Kailath, 1995). However, the drawback with these networks is that they grow in size, albeit polynomially, with the number of bits required for the representation of the numbers to be multiplied. Such a quality is certainly disastrous when one would like to process real numbers with possibly innite precision. Therefore, networks of summing units do not seem to be an adequate substitute for genuine multiplicative units. 2.2.3 Units versus Neurons. We have introduced all neuron models that will be considered in this article as units. As common in the neural network literature, we could have called them neurons or gates synonymously, so that we could speak of a sigmoidal neuron or a product gate. In what follows, we
Multiplicative Neural Networks
249
shall always use the term unit for the computational element of a neural network, thereby viewing it as an abstraction for the functionality it represents. In an articial neural network, a unit may be implemented by connecting several model neurons together, such as monomials and a linear unit form a higher-order unit. A unit may also serve as a model for information processing in some part of a single biological neuron such as monomials or product units used to model nonlinear interactions of synaptic inputs in dendritic trees. 2.3 Multiplication in Biological Neural Networks. There are several reasons that neurobiologists study multiplication as a computational mechanism underlying the behavior of neural systems. First, it can be used to model the nonlinearities involved in dendritic processing of synaptic inputs. Second, it is shown to arise in the output activity of individual model neurons or in neural populations. Third, it can be employed to explain how simple model networks can achieve complex behavior with biologically plausible methods. And nally, multiplicative neurons are found in real neural networks. In the following, we mention some of the research that has been done in each of these directions. Further references regarding the fundamental role and the evidence of multiplication-like operations on all levels of neural information processing can be found in Koch and Poggio (1992) and Koch (1999). 2.3.1 Dendritic Multiplication and Division. In a large quantity of neuron models, interaction of synaptic inputs is modeled as a linear operation. Thus, synaptic inputs are combined by adding them. This is evident for the summing units introduced above, but it holds also for the more complex and biologically closer model of the leaky integrate-and-re neuron (see, e.g., Softky & Koch, 1995; Koch, 1999, for surveys of single-neuron models).3 Linearity is believed to be sufcient for capturing the passive, or cable, properties of the dendritic membrane where synaptic inputs are currents that add. From numerous studies using recording experiments or computer simulations, sufcient evidence has arisen that synaptic inputs can interact nonlinearly when the synapses are co-localized on patches of dendritic membrane with specic properties. Thus, the spatial grouping of synapses on the dendritic tree is reected in the computations performed at local branches. The summing operation, due to its associativity merely representing the dendrite as an amorphous device, does not take hold of this. Consequently, 3 The linearity referred to here concerns the way several inputs interact, and not the neural response to a single synaptic input. Leaky integrate-and-re models treat single synaptic inputs as nonlinearitites by describing the course of the postsynaptic response as a nonlinear function in time. Here, we do not take detailed temporal effects into account but consider synaptic inputs as represented by weighted analog variables.
250
Michael Schmitt
these local computations must be nonlinear, thereby enriching the computational power of the neuron by nonlinear computations that take place in the dendritic tree prior to the central nonlinearity of the neuron, the threshold. It has been convincingly argued that these dendritic nonlinearities should be modeled by multiplication. For instance, performing extensive computer experiments, Mel (1992b, 1993) has found that excitatory voltage-dependent membrane mechanisms, such as NMDA receptor channels, could form a basis for multiplicative interactions among neighboring synapses. The exhibited so-called cluster sensitivity can give rise to complex pattern recognition tasks performed by single neurons. This is also shown in a learning experiment where a biologically plausible, Hebbian-like learning rule is introduced to manipulate the spatial ordering of synaptic connections onto the dendritic tree (Mel, 1992a). Multiplicative-like operations in dendritic trees are also shown to take place in the form of division, an operation that is not performed by monomials and higher-order units but is available in the product units of Durbin and Rumelhart (1989) by using negative exponents (see section 2.2.2). Neurobiological results exhibiting division arise mainly in investigations concerned with shunting inhibitory synapses. For instance, in a mathematical analysis of a simple model neuron, Blomeld (1974) derives sufcient conditions for inhibitory synapses to perform division. The division operation can be seen as a continuous analog of the logical AND-NOT, also referred to as veto mechanism, a form in which it is studied by Koch, Poggio, and Torre (1983). Using computer simulations, they show that inhibitory synapses are able to perform such an analog veto operation. The capability of computing division is also found to be essential in a model constructed by Carandini and Heeger (1994) for the responses of cells in the visual cortex. The divisive operation of the cell membrane of the model neuron explains how nonlinear operations required for selectivity to various visual input stimuli, such as position, orientation, and motion direction, can be achieved by single neurons. These authors also show that their theoretical results compare well with physiological data from the monkey primary visual cortex. Perhaps one of the earliest modeling studies involving nonlinear dendritic processing, albeit not explicitly multiplicative, is due to Feldman and Ballard (1982), who demonstrate how complex cognitive tasks can be implemented in a biologically plausible way using their model neurons. For a detailed account of further ideas and results about dendritic computation, we refer to the review by Mel (1994). 2.3.2 Model Neurons and Networks That Multiply. Beyond using multiplication for modeling dendritic computation, investigations have been aimed at revealing that entire neurons can function as multipliers. Even in the early days of cybernetics research, it was known that under certain conditions, a coincidence-detecting neuron performs a multiplication by transforming the frequencies of input spikes arriving at two synaptic sites into an output
Multiplicative Neural Networks
251
spike frequency that is proportional to the product of the input frequencies (Kupfm ¨ uller ¨ & Jenik, 1961). Similar studies based on slightly different neuron models have been carried out by Srinivasan and Bernard (1976), Bugmann (1991), and Rebotier and Droulez (1994), with comparable results. Bugmann (1991) argues that proximal synapses lead to a multiplicative behavior, while with distal synapses the neuron operates in a summation mode. A further model of Bugmann (1992) includes time-dependent synaptic weights for the compensation of a problem caused by irregularities in input spike trains, which can deteriorate multiplicative behavior. Tal and Schwartz (1997) show that even the summing operation can be used to compute multiplication. They nd sufcient conditions for a leaky integrate-and-re neuron to compute a nonlinearity close to the logarithm. In this way, the logarithm of a product can be obtained by summing the outputs of leaky integrate-and-re neurons. By means of the ln-exp transform xy D eln x C ln y , the logarithm is also a key ingredient for the biophysical implementation of multiplication proposed by Koch and Poggio (1992) (see also Koch, 1999). As such, the transform is also suggested by Durbin and Rumelhart (1989) for their product units. Multiplication is shown to arise not only in single elements, but also in an ensemble of neurons, where it emerges as a property of the network. Salinas and Abbott (1996) show by computer simulations that population effects in a recurrent network can lead to multiplicative neural responses even when the individual neurons are not capable of computing a product. 2.3.3 Complex Neural Behavior Through Multiplication. Assuming that multiplicative operations can be carried out in the nervous system, several researchers have established that model neural networks relying on multiplication are capable of solving complex tasks in a biologically plausible way. Poggio (1990) proposes a general framework for the approximate representation of high-dimensional mappings using radial basis functions. In particular, he shows that multiplication offers a computationally powerful and biologically realistic possibility of synthesizing high-dimensional gaussian radial basis functions from low dimensions (see also Poggio and Girosi, 1990a, 1990b). Building on this theory, Pouget and Sejnowski (1997) demonstrate that basis functions could be used by neurons in the parietal cortex to encode sensory inputs in a format suitable for generating motor commands. The hidden units of their model network, which is trained using gradient descent, compute the product of a gaussian function of retinal location with a sigmoidal function of eye position. Learning in multiplicative basis function networks is also investigated by Mel and Koch (1990) using a Hebbian method. Olshausen, Anderson, & Van Essen (1993) suggest a mechanism for how the visual system could focus attention and achieve pattern recognition that
252
Michael Schmitt
is invariant to position and scale by dynamic routing of information. The neural circuit contains control units that make multiplicative contacts in order to modulate the strengths of synaptic connections dynamically. The multiplicative network of Salinas and Abbott (1996) is conceived for solving a similar task. They show that it can transform visual information from retinal coordinates to coordinates that represent object locations with respect to the body. That nonlinear dendritic computation can be used in visual cortical cells for translation-invariant tuning is shown by the computer simulations of Mel, Ruderman, and Archie (1998). A neuron model proposed by Maass (1998) includes input variables for the representation of ring correlations among presynaptic neurons, in addition to the common input variables that represent their ring rates. In the computation of the output, the correlation variables are multiplicatively combined with monomials of rate variables. The model accounts for various phenomena believed to be relevant for complex information processing in biological networks, such as synchronization and binding. Theoretical results show that the model neurons and networks are computationally more powerful than models that compute only in terms of ring rates. 2.3.4 Biological Multiplicative Neurons. Neurons that perform multiplicative-like computations have been identied in several biological nervous systems. Recordings performed by Andersen et al. (1985) from single neurons in the visual cortex of monkeys show that the selectivity of their receptive elds changes with the angle of gaze. Moreover, the interaction of the visual stimulus and the eye position in these neurons is found to be multiplicative. Thus, they could contribute to the encoding of spatial locations independent of eye position. Suga, Olsen, and Butman (1990) describe arrays of neural lters in the auditory system of the bat that provide a means for processing complex sound signals by operating as multipliers. As such, they are involved in a cross-correlation analysis of distance information conveyed by echo delays (see also Suga, 1990). Investigating the visual system of the locust, Hatsopoulos et al. (1995) show that a single motion-sensitive neuron, known as the lobula giant motion detector, performs a multiplication of two independent input signals. From experimental data, they derive an algorithm that could be used in the visual system to anticipate the time of collision with approaching objects. The results reveal multiplication to be an elementary building block underlying motion detection in insects. In the work of Gabbiani et al. (1999), these investigations are continued, resulting in a conrmation and generalization of the model. In an experimental study of the visual system of the cat, Anzai et al. (1999a, 1999b) nd that neural mechanisms underlying binocular interaction are based on multiplication. They show that the well-studied simple and complex cells in the visual cortex perform multiplicative operations
Multiplicative Neural Networks
253
analogous to those that are used in an algorithm for the computation of interocular cross-correlation. These results provide a possible explanation of how the visual system could solve the stereo correspondence problem. 2.4 Learning in Articial Multiplicative Networks. Early on in the history of articial neural networks and machine learning, higher-order neurons were used as natural extensions of the widespread linearly weighted neuron models. One of the earliest machine learning algorithms using higher-order terms was the checker playing program developed by Samuel (1959). Its decisions are based on computing a weighted sum including second-order Boolean conjunctions. In the 1960s, higher-order units were very common in pattern recognition, where they appear in the form of polynomial discriminant functions (Cover, 1965; Nilsson, 1965; Duda & Hart, 1973). There is an obvious reason for this popularity: higher-order units are computationally more powerful than single neurons since the higherorder terms act as hidden units. Furthermore, since these hidden units have no weights, there is no need to backpropagate errors when using gradient descent for training. Therefore, considering a higher-order unit as a linear unit in an augmented input space, all available learning methods for linear units are applicable. Later, the delta rule, developed within the framework of parallel distributed processing, was easily generalized to networks having hidden units of higher order, resulting in a learning method for backpropagating errors in networks of higher-order sigmoidal units (Rumelhart, Hinton, & McClelland, 1986; Rumelhart, Hinton, & Williams, 1986). Soon, however, higher-order networks were caught up by what is known as the curse of dimensionality. Due to the fact that the number of monomials in a higher-order unit can be exponential in the input dimension, the complete representation of a higher-order unit succumbs to a combinatorial explosion. Even if the order of the monomials is restricted by some xed value, the number of higher-order terms, and thus the number of free parameters to manipulate, is exponential in this bound—too large for many practical applications. Therefore, networks of higher-order neurons have gained importance in cases where the number of parameters can be kept low. One area of application for these sparse higher-order networks is the recognition and classication of patterns that underlie various geometric transformations such as scale, translation, and rotation. Maxwell, Giles, Lee, and Chen (1986) describe a method for incorporating invariance properties into a network of higher-order units that leads to a signicant reduction of the number of parameters (see also Giles & Maxwell, 1987; Bishop, 1995, for an explanation). Thus, faster training can also be achieved by encoding into the network a priori knowledge that need not be learned. Several studies using invariant recognition and classication experiments show that higher-order networks can be superior to standard neural networks and other methods with respect to training time and generalization capabilities (see, e.g., Giles & Maxwell, 1987; Perantonis & Lisboa, 1992;
254
Michael Schmitt
Schmidt & Davis, 1993; Spirkovska & Reid, 1994). Invariance properties are also established for the so-called pi-sigma networks introduced by Ghosh and Shin (1992). A pi-sigma network consists of one hidden layer of summing units and has a monomial as output unit. Compared to a sigma-pi unit, it has the advantage of reducing the number of adjustable parameters further. The dependence of the number of weights on the order is linear for a pi-sigma network, whereas it can be exponential for a higher-order unit. While invariance and pi-sigma networks are mostly used with restricted order, researchers have also strived to use the full computational power of higher-order networks, that is, without imposing any constraints on the degree of multiplicative interactions. In order for this to be accomplished, however, it is not sufcient simply to adopt learning methods from standard neural networks. Instead, a number of new learning algorithms have been developed that allow units of any order but enforce sparseness by incrementally building up higher orders and adding units only when necessary. Instances of such algorithms can be found in the work of Redding, Kowalczyk, and Downs (1993), Ring (1993), Kowalczyk and FerrÂa (1994), Fahner and Eckmiller (1994), Heywood and Noakes (1995), and Roy and Mukhopadhyay (1997). A constructive algorithm for a class of pi-sigma networks, called ridge polynomial networks, is proposed by Shin and Ghosh (1995). A method that replaces units in a quadratic classier with a limited number of terms is devised in the early work of Samuel (1959). All of these incremental algorithms, however, have the disadvantage that they create higher order in a step-wise, discrete manner. Further, once a unit has its order, it cannot change it unless the unit is deleted and a new one added. This problem is overcome by the product units of Durbin and Rumelhart (1989), where high order appears in the form of real-valued, adjustable weights that are exponents of the input variables. Thus, a product unit is able to learn any higher-order term. Moreover, its computational power is signicantly larger compared to a monomial due to the fact that exponents can be nonintegral and even negative. In further studies by Janson and Frenzel (1993), Leerink et al. (1995a, 1995b), and Ismail and Engelbrecht (2000), product unit networks are found to be computationally more powerful than sigmoidal networks in many learning applications. In particular, it is shown that they can solve many well-studied problems using fewer neurons than networks with summing units. Besides the articial neural network learning methods, backpropagation and cascade correlation and more global optimization algorithms such as genetic algorithms, simulated annealing, and random search techniques are considered. A considerable volume of research has been concerned with the encoding and learning of formal languages in higher-order recurrent networks. Especially second-order networks have proven to be useful for learning nite-state automata and recognizing regular languages (Giles, Sun, Chen, Lee, & Chen, 1990; Pollack, 1991; Giles et al., 1992; Watrous & Kuhn, 1992; Omlin & Giles, 1996a, 1996b). Higher-order units have also been employed
Multiplicative Neural Networks
255
as computational elements in Boltzmann machines and Hopeld networks in order to enlarge the capabilities and overcome the limitations of the rstorder versions of these network models (Lee et al., 1986; Sejnowski, 1986; Psaltis, Park, & Hong, 1988; Venkatesh & Baldi, 1991a, 1991b; Burshtein, 1998). Results on storage capacities and learning curves for single higherorder units are established by Yoon and Oh (1998) using an approach from statistical mechanics. 3 Vapnik-Chervonenkis and Pseudo-Dimension As the main subjects of study, we now introduce the tools for assessing the computational and learning capabilities of multiplicative neural networks: the VC dimension and the pseudo-dimension. We give the denition of the VC dimension in section 3.1 and exhibit as a helpful property that if the output node of a network is a summing unit, then it does not matter for the VC dimension which one it is. We also postulate a condition for the parameter domain of product units that avoids the problem of having to deal with complex numbers. In contrast to the VC dimension, the pseudo-dimension takes into account that neural networks in general deliver output values that are not necessarily binary. Thus, the pseudo-dimension appears suitable for networks that employ as output node a real-valued unit such as the linear or the sigmoidal unit, the monomial or the product unit. Nevertheless, after introducing the pseudo-dimension in section 3.2 we shall point out that the pseudo-dimension of a neural network, being an upper bound on its VC dimension, can also almost serve as lower bound since it can be considered as the VC dimension of a slightly augmented network. In section 3.3 we give a brief review of VC dimension bounds for neural networks that have been previously established in the literature. The bounds concern networks with summing and multiplicative units and are the starting point for the results that will be derived in the subsequent sections. 3.1 Vapnik-Chervonenkis Dimension. Before we can give a formal definition of the VC dimension, we require some additional terms. A partition of a set S µ Rn into two disjoint subsets ( S 0 , S1 ) is called a dichotomy of S. A function f : Rn ! f0, 1g is said to induce the dichotomy ( S 0, S1 ) of S if f satises f (S 0 ) µ f0g and f (S1 ) µ f1g. More generally, if F is a class of functions mapping Rn to f0, 1g then F induces the dichotomy (S 0 , S1 ) if there is some f 2 F that induces ( S 0, S1 ) . Further, the class F shatters S if F induces all possible dichotomies of S. Denition. The Vapnik-Chervonenkis (VC) dimension of a class F of functions that map Rn to f0, 1g, denoted VCdim( F ) , is the cardinality of the
256
Michael Schmitt
largest set shattered by F . If F shatters arbitrarily large sets, then the VC dimension of F is innite. The denition applies to function classes, but it is straightforwardly transferred to neural networks. Consider a network having connections and nodes labeled with programmable parameters, that is, weights and thresholds, respectively, and having a certain number, say n, of input nodes and one output node. If the output node is a threshold unit, there is a set of functions mapping Rn to f0, 1g naturally associated with this network: the set of functions obtained by assigning all possible values to the network parameters. Thus, the VC dimension of a neural network is well dened if the output node is a threshold unit. If the output node is not a threshold unit, the network functions are made f0, 1g-valued to comply with the denition of the VC dimension. A general convention is to compare the output of the network with some xed threshold h, for instance, h D 1 / 2 in the case of a sigmoidal unit. This is also common in applications when networks with continuous output values are used for classication tasks. More specically, the binary output value of the network is obtained by applying the function y 7! sgn ( y ¡ h ) to the output y of the real-valued network. Thus, h can be considered as an additional parameter of the network that may be chosen independently for every dichotomy. However, one can show that for a sigmoidal or a linear output node, the VC dimension does not rely on the specic values of this threshold. Moreover, it is also independent of the particular summing unit that is chosen. The following result makes this statement more precise. We omit its proof since it is easily obtained using scalings of the output weights and adjustments of the thresholds. Proposition 1. The VC dimension of a neural network remains the same regardless of which summing unit (threshold, sigmoidal, or linear) is chosen for the output node. According to this statement, we may henceforth assume without loss of generality that if the output node of a network is a summing unit, then it is a linear unit. Linear output nodes are commonly used, for instance, in neural networks constructed with the aim of approximating continuous functions (Pinkus, 1999). Beyond analyzing single networks, we also investigate the VC dimension of particular classes of networks. In this case, we refer to the VC dimension of a class of networks as the VC dimension of the class of functions obtained by taking the union of the function classes associated with each particular network. A problem arises with networks containing product units that receive negative inputs and have weights that are not integers. A negative number raised to some nonintegral power yields a complex number and has
Multiplicative Neural Networks
257
no meaning in the reals. Since neural networks with complex outputs are rarely used in applications, Durbin and Rumelhart (1989) suggest a method for proceeding in this case. The idea is to discard the imaginary part and use only the real component for further processing. For Boolean inputs, this implies that the product unit becomes a summing unit that uses the cosine activation function. Although there are no problems reported from applications so far, this manipulation would have disastrous consequences for the VC dimension if it was extended to real-valued inputs. It is known that a summing unit with the sine or cosine activation function can shatter nite subsets of R of arbitrarily large cardinality (Sontag, 1992; Anthony & Bartlett, 1999). Therefore, no nite VC dimension bounds can in general be derived for networks containing such units. To avoid these problems arising from negative inputs in combination with fractional weights, we shall require that the following condition on the parameter domain of product units is always satised in the networks we consider. Condition. If an input xi of a product unit is negative, the corresponding weight wi is an integer. This presupposition guarantees that there are no complex outputs resulting from product units, and it still permits viewing the product unit as a generalization of the monomial and, hence, networks containing product units as generalizations of networks with higher-order units. One of the main results in this article will be that networks with product units, where inputs and parameters satisfy the above condition, have a VC and pseudodimension that is a low-degree polynomial in the number of network parameters and the network size. Because product units comprise monomials, this will then also lead to a similar bound for higher-order networks. 3.2 Pseudo-Dimension. Besides the VC dimension, several other combinatorial measures have been considered in the literature for the characterization of the variety of, in particular, a real-valued function class. The most relevant is the pseudo-dimension, which is also well known for providing bounds on the number of training examples in models of learning (Anthony & Bartlett, 1999). Denition. Let F be a class of functions that map Rn to R. The pseudodimension of F , denoted Pdim( F ) , is the VC dimension of the class fg: Rn C 1 ! f0, 1g | there is some f 2 F such that for all x 2 Rn and y 2 R: g ( x, y ) D sgn ( f ( x) ¡ y )g. The pseudo-dimension of a neural network is then dened as the pseudodimension of the class of functions computed by this network. Clearly, the VC dimension of a network is not larger than its pseudo-dimension. The
258
Michael Schmitt
pseudo-dimension and the VC dimension of a network with a summing unit as output node are even more closely related. In particular, if there is an upper bound on the VC dimension of such a network in terms of the number of weights and computation nodes, then one can also obtain an upper bound on the pseudo-dimension. We shall show this now in a statement that slightly improves theorems 14.1 and 14.2 of Anthony and Bartlett (1999) if the output node is a summing unit. The latter results require two more connections and one more computation node, whereas the following manages with the same number of computation nodes and only one additional connection. Proposition 2. Let N be a neural network having as output node a summing unit (threshold, sigmoidal, or linear). If the output node is a threshold unit, then Pdim( N ) D VCdim( N ) . If the output node is a sigmoidal or a linear unit, then there is a network N 0 with the same computation nodes as N , one more input node, and one more connection such that Pdim( N ) D VCdim( N 0 ) . Proof. by
Suppose N
has n input nodes and let the function class G be dened
G D fg: Rn C 1 ! f0, 1g | there is some f computed by N such that for all x 2 Rn and y 2 R: g ( x, y ) D sgn ( f ( x) ¡ y) g, according to the denition of the pseudo-dimension of N . If the output node of N is a threshold unit, then it can easily be seen that for s1 , . . . , sm 2 Rn and u1 , . . . , um 2 R, G shatters f(s1 , u1 ) , . . . , ( sm , um ) g if and only if G shatters f(s1 , 1 / 2), . . . , ( sm , 1 / 2) g. Further, the latter set is shattered by G if and only if the set fs1 , . . . , sm g is shattered by N . Thus, Pdim( N ) D VCdim(N ) . If the output node is linear or sigmoidal, we construct N 0 by adding to N a new input node for the variable y and a connection with weight ¡1 from this input node to the output node. Due to proposition 1, we may employ without loss of generality a linear unit for the output node of N 0 . Clearly, if N has a linear output node, then the set f(s1 , u1 ) , . . . , (sm , um )g is shattered by G if and only if the set f(s1 , u1 ) , . . . , ( sm , um ) g is shattered by N 0 . Further, if N has a sigmoidal output node, then the set f(s1 , u1 ) , . . . , (sm , um )g is shattered by G if and only if the set f(s1 , s ¡1 ( u1 ) ) , . . . , (sm , s ¡1 ( um ) ) g is shattered by N 0 . 3.3 Known Bounds on the VC Dimension of Neural Networks. We give a short survey of some previous results on the VC dimension of neural networks. Most of the bounds are for networks consisting solely of summing units. Since a summing unit can be considered as a multiplicative unit with order restricted to one, special cases of these results are relevant for the networks considered in this article. Further, some of the results hold for
Multiplicative Neural Networks
259
polynomial activation functions of restricted order and are given in terms of a bound on this order. It is therefore interesting to see how these bounds are related to the order-independent bounds derived here. We quote the results for the most part in their asymptotic form. The reader may nd constants in the cited references or in Anthony and Bartlett (1999). The bounds can be divided into three categories: for single units, networks, and classes of networks. The VC dimension of a summing unit is known to be n C 1, where n is the number of variables, and is hence equal to the number of adjustable parameters. This holds for the threshold, sigmoidal, and linear unit. The proof of this fact goes back to an ancient result by Schl¨ai (1901). The pseudodimension of these units has also been determined exactly, being n C 1 for each of them as well (see, e.g., Anthony & Bartlett, 1999). There is a large quantity of results on the VC dimension for neural networks that takes into account various unit types and network architectures. An upper bound for networks of threshold units is calculated by Baum and Haussler (1989). They show that a network with k computation nodes, which are all threshold units, and a total number of W weights has VC dimension O (W log k) . This bound can also be derived from an earlier result of Cover (1968). Its tightness in the case of threshold units is established by two separate works. Sakurai (1993) shows that the architecture with one hidden layer has VC dimension V ( W log k) on real-valued inputs. Maass (1994) provides an architecture with two hidden layers respecting the bound V ( W log W ) on Boolean inputs. There are also some bounds known for networks using more powerful unit types. A result of Goldberg and Jerrum (1995), which was independently obtained by Ben-David and Lindenbaum (1998), shows that neural networks employing as computation nodes higher-order units with an order bounded by some xed value have VC dimension O (W 2 ), where W is the number of weights. This bound holds even if the activation functions of the units are piecewise polynomial functions with an order bounded by a constant. 4 Koiran and Sontag (1997) construct networks of threshold and linear units with VC dimension V (W 2 ), thus showing that the quadratic upper bound of Goldberg and Jerrum (1995) is asymptotically optimal. Networks with depth restrictions are considered by Bartlett, Maiorov, and Meir (1998). They establish the bound O ( W log W ) for networks with piecewise polynomial activation functions, a xed number of layers, and
4 For polynomial neural networks, Goldberg and Jerrum (1995) derive no explicit bound in terms of the order. The bound O (W 2 ) for neural networks is obtained via an upper bound on the time needed by an algorithm to compute the output value of the network. This latter bound is linear in the running time of the algorithm. (More precisely O (Wt), if the number of computation steps is bounded by t.) Since the algorithm may multiply two real numbers in constant time, the time required for the evaluation of a polynomial grows linearly in the order of the polynomial. These considerations lead to an upper bound that is linear in the order (see also Anthony & Bartlett, 1999, theorem 8.7).
260
Michael Schmitt
polynomials of bounded order. Expressed in terms of the number L of layers and the bound d on the order, their result is O ( WL log W C WL2 log d ) . A similar upper bound is obtained by Sakurai (1999). Bartlett et al. (1998) also show that V (WL ) is a lower bound for piecewise polynomial networks with W weights and L layers. An upper bound for sigmoidal networks is due to Karpinski and Macintyre (1997), who show that higher-order sigmoidal networks with W weights and k computation nodes of order at most d have VC dimension O ( W 2 k2 C Wk log d ) . Koiran and Sontag (1997) establish V ( W 2 ) as a lower bound for sigmoidal networks. Finally, some authors consider classes of networks. Hancock, Golea, and Marchand (1994) show that the so-called class of nonoverlapping neural networks with n input nodes and consisting of threshold units has VC dimension O ( n log n ). In a nonoverlapping network, all nodes, computation as well as input nodes, have fan-out at most one. That this bound is asymptotically tight for this class of networks is shown by Schmitt (1999) establishing the lower bound V ( n log n ) . Classes of higher-order units with order restrictions are studied by Anthony (1995). He shows that the VC dimension of the class of higher-order units with n inputs and order not larger than d is equal to ( n Cd d ) . Hence, this VC dimension respects the upper and lower bound H (nd ). Karpinski and Werther (1993) consider classes of higher-order units with one input node, or univariate polynomials, with a restriction on the fan-in of the output node, or equivalently, the number of monomials. They show that the class of higher-order units having one input node and at most k monomials has VC dimension H ( k) . It is noteworthy that this result allows higher-order units of arbitrary order and does not depend on a bound for this order. 4 Upper Bounds In the following, we derive upper bounds on the VC dimension and the pseudo-dimension of various feedforward networks with multiplicative units in terms of the number of parameters and the network size. We begin in section 4.1 by considering networks with one hidden layer of product units and a summing unit as output node. Then bounds for arbitrary networks, where each node may be a product or a sigmoidal unit, are established in section 4.2. These results are employed in section 4.3 to determine bounds for higher-order sigmoidal networks. In section 4.4 we consider single higher-order units and, nally, in section 4.5 we focus on product units and monomials. 4.1 Product Unit Networks with One Hidden Layer. We consider rst the most widely used type of architecture, which has one hidden layer of computation nodes such as shown in Figure 1. Throughout this section, we assume that the output node is a summing unit. The use of this class of
Multiplicative Neural Networks
261
networks is theoretically justied by the so-called universal approximation property. Results from approximation theory show that networks with one hidden layer of product or sigmoidal units and a linear unit as output node are dense in the set of continuous functions and hence can approximate any such function arbitrarily well (see, e.g., Leshno, Lin, Pinkus, & Schocken, 1993; Pinkus, 1999). We recall from the condition on the parameter domain of product units stated in section 3.1 that for product units to yield output values in the reals, the input domain is restricted such that for a nonintegral weight, the corresponding input value is from the set R 0C D fx 2 R: x ¸ 0g. There is, however, still a problem when a product unit receives the value 0, and this input is weighted by a negative number. Then the output value of the unit is undened. The possibility of forbidding 0 as input value in general is certainly too restrictive and would go against many learning applications. Therefore, we use a default value in case a unit raises the input value 0 to some negative power and say that the output value of such a unit is 0.5 With these agreements, the following can be shown. (Here and in subsequent formulas we use “log” to denote the logarithm of base 2.) Theorem 1. Suppose that N is a neural network with one hidden layer consisting of k product units, and let W be the total number of parameters (weights and threshold). Then N has pseudo-dimension at most ( Wk) 2 C 8Wk log(13Wk) . The main ideas of the proof are rst to derive from N and an arbitrary set S of input vectors a set of exponential polynomials in the parameter variables of N . The next step is to consider the connected components of the parameter domain that arise from the zero sets of these polynomials. A connected component (also referred to as a cell) of a set D µ Rd is a maximal nonempty subset C of D such that any two points of C are connected by a continuous curve that lies entirely in C. The polynomials are constructed in such a way that the number of dichotomies that N induces on S is not larger than the number of connected components generated by these polynomials. Thus, a bound on the number of connected components also limits the number of dichotomies. The basic tools for this calculation are provided by Karpinski and Macintyre (1997), who combined the work of Warren (1968) and KhovanskiÆ õ (1991) to obtain for the rst time polynomial bounds on the 5 Another way would be to introduce three-valued logic such that the output of a network that divides by 0 is some value “undened,” that is, neither 0 nor 1. This then leads to a different notion of dimension that takes multiple-valued outputs into account. Ben-David, Cesa-Bianchi, Haussler, and Long (1995) study such dimensions and their relevance for learning. In particular, they show that a large variety of dimensions for multiple-valued functions are closely related in that they differ from each other at most by a constant factor. Therefore, should the three-valued approach be preferred, the bounds derived here can also be used to obtain estimates for these generalized dimensions.
262
Michael Schmitt
VC dimension of sigmoidal neural networks. These tools are explicated and further developed in Anthony and Bartlett (1999). We introduce two denitions from this book for sets of functions: the property to have regular zero-set intersections and the solution set components bound (denitions 7.4 and 7.5 of Anthony & Bartlett, 1999). A set f f1 , . . . , fkg of differentiable real-valued functions on Rd is said to have regular zero-set intersections if for every nonempty set fi1 , . . . , il g µ f1, . . . , kg, the Jacobian (i.e., the matrix of the partial derivatives) of ( fi1 , . . . , fil ) : Rd ! Rl has rank l at every point of the set fa 2 Rd : fi1 ( a) D ¢ ¢ ¢ D fil (a ) D 0g. A class G of real-valued functions dened on Rd has solution set components bound B if for every k 2 f1, . . . , dg and every f f1 , . . . , fkg µ G that has regular zero-set intersections the number of connected components of the set fa 2 Rd : f1 ( a) D ¢ ¢ ¢ D fk ( a) D 0g is at most B. The proofs for the three subsequent lemmas can be found in the appendix. Lemma 1. Let G be the class of polynomials of degree at most p in the variables y1 , . . . , yd and in the exponentials e g1 , . . . , e gq , where g1 , . . . , gq are xed afne functions in y1 , . . . , yd . Then G has solution set components bound, B D 2q ( q¡1 ) / 2 [p(p C 1)d C p]d [(p C 1) d ( d C 1) C 1]q . Lemma 2. Let q be a natural number and suppose G is the class of real-valued functions in the variables y1 , . . . , yd satisfying the following conditions: For every f 2 G there exist afne functions g1 , . . . , gr , where r · q, in the variables y1 , . . . , yd such that f is an afne combination of y1 , . . . , yd and eg1 , . . . , egr . Then G has solution set components bound, ( ) B D 2dq dq¡1 / 2 [2(dq C d ) C 1]dqC d [2(dq C d ) ( dq C d C 1) C 1]dq.
A class F of real-valued functions is said to be closed under addition of constants if for every c 2 R and f 2 F the function z 7! f ( z ) C c is a member of F . The following result gives a stronger formulation of a bound stated in theorem 7.6 of Anthony and Bartlett (1999): Lemma 3. Let F be a class of real-valued functions ( y1 , . . . , yd , x1 , . . . , xn ) 7! f ( y1 , . . . , yd , x1 , . . . , xn ) that is closed under addition of constants and where each function in F is Cd in the variables y1 , . . . , yd . If the class G D f(y1 , . . . , yd ) 7! f ( y1 , . . . , yd , s) : f 2 F , s 2 Rn g has solution set components bound B, then
Multiplicative Neural Networks
263
for any sets f f1 , . . . , fkg µ F and fs1 , . . . , sm g µ Rn , where m ¸ d / k, the set T µ f0, 1gmk dened as T D f(sgn ( f1 ( a, s1 ) ) , . . . , sgn ( f1 ( a, sm ) ) , . . .
. . . , sgn ( fk ( a, s1 ) ) , . . . , sgn ( fk ( a, sm ) ) ) : a 2 Rd g
satises |T| · B
´ d ³ X mk iD 0
i
³ ·B
emk d
´d .
Now we have all that is required for the proof of theorem 1. Proof of Theorem 1. Assume that N is given as supposed, having k hidden product units and W parameters. Denote the weights of the hidden nodes by wi, j , and let v 0 , v1 , . . . , v k be the weights of the output nodes where v 0 is the threshold. We rst use an idea from Karpinski and Macintyre (1997) and divide for each i 2 f1, . . . , kg the functions computed by N into three categories corresponding to the sign of the parameter vi , that is, depending on whether vi < 0, vi D 0, or vi > 0. This results in a total of 3 k categories for all v1 , . . . , v k. We consider each category separately and introduce new variables v01 , . . . , v0k for the weights of the output node by dening 8 0, if vi D 0, if vi < 0,
for i D 1, . . . , k. Thus, within this category, we can write the computation of N on some real input vector s as a function of the network parameters in the form w
0
wk, p
1, p1 1,1 vk wk, 1 k (v 0 , v0 , w) 7! v 0 C b1 ev1 sw 1,1 ¢ ¢ ¢ s1,p1 C ¢ ¢ ¢ C b k e s k,1 ¢ ¢ ¢ s k,pk , 0
where the si, j are components of the input vector and bi 2 f0, 1g is dened by » bi D
0 1
if v0i D 0 or si, j D 0 for some j 2 f1, . . . , pi g, otherwise,
for i D 1, . . . , k. Note that if si, j D 0 and wi, j < 0 for some i, j, we dene the output of the affected product unit to be 0. Thus, we may assume that 6 0 for all i, j without loss of generality. Recall further that according to si, j D the condition on the parameter domain of product units stated in section 3.1, for si, j < 0 we have required wi, j to be an integer. In this case, however, the
264
Michael Schmitt
sign of si, j can have an effect only when the weight is odd. And this effect on the product unit consists of only a possible change of the sign of its output value. Since the sign of si, j is not known, we consider the worst case and assume that each input vector generates all possible signs at the outputs of the product units. Therefore, we can restrict the input vectors to the positive reals if we take note of the fact that each input vector gives rise to at most 2 k functions of the form w
wk, p
1,p1 1,1 vk wk,1 k ( v 0, v0 , w ) 7! v0 § b1 ev1 sw 1,1 ¢ ¢ ¢ s1,p1 § ¢ ¢ ¢ § b k e sk,1 ¢ ¢ ¢ sk,pk . 0
0
For positive input values, these functions can be rewritten as ( v 0, v0 , w ) 7! v0 § b1 exp(v01 C w1,1 ln s1,1 C ¢ ¢ ¢ C w1,p1 ln s1,p1 ) § ¢ ¢ ¢ ¢ ¢ ¢ § bk exp(v0k C wk,1 ln sk,1 C ¢ ¢ ¢ C w k,pk ln sk,pk ) . To obtain a bound on the pseudo-dimension, we want to estimate the number of dichotomies that are induced on some arbitrary set f(s1 , u1 ) , . . . , (sm , um )g, where s1 , . . . , sm are input vectors for N and u1 , . . . , um are real numbers, by functions of the form ( x, z ) 7! sgn ( f ( x ) ¡ z ), where f is computed by N . We do this for each of the categories dened above separately. Thus, within one such category, the number of dichotomies induced is at k most as large as the cardinality of the set T µ f0, 1gm2 satisfying T D f(sgn ( f1 (a, s01 , u1 ) ) , . . . , sgn ( f1 ( a, s0m , um ) ) , . . .
. . . , sgn ( f2k ( a, s01 , u1 ) ) , . . . , sgn( f2k (a, s0m , um ) ) ) : a 2 RW g
for real vectors s01 , . . . , s0m and functions f1 , . . . , f2k , where each of these functions has the form ( y, x, z ) 7! c0 C y 0 C c1 exp(y1 C y1,1 x1,1 C ¢ ¢ ¢ C y1,p1 x1,p1 ) C ¢ ¢ ¢ ¢ ¢ ¢ C ck exp(yk C y k,1 x k,1 C ¢ ¢ ¢ C y k,pk x k,pk ) ¡ z for real numbers c0, c1 , . . . , ck with c0 D 0 and ci 2 f¡1, 0, 1g for i D 1, . . . , k. The variables yi and yi, j play the role of the network parameters, xi, j are the input variables receiving values s0i, j D ln si, j , and z is the input variable for u1 , . . . , um . Let F denote the class of these functions arising for arbitrary real numbers c0, c1 , . . . , ck. We have introduced c0 in particular to make this class F closed under addition of constants. Now, for the vectors s01 , . . . , s0m and the real numbers u1 , . . . , um , we consider the function class G D fy 7! f ( y, s0i , ui ) : f 2 F , i D 1, . . . , mg. Clearly, every element of G is an afne combination of W variables and k exponentials of afne functions in these variables. According to lemma 2, G has solution set components bound B D 2Wk( Wk¡1 ) / 2 [2(Wk C W ) C 1]WkC W £ [2(Wk C W ) ( Wk C W C 1) C 1]Wk .
(4.1)
Multiplicative Neural Networks
265
Since F is closed under addition of constants, we have from lemma 3 that |T| · B ( em2k / W ) W , which is by the construction of T an upper bound on the number of dichotomies that are induced on any set of m vectors f(s1 , u1 ) , . . . , (sm , um )g. Since this bound is derived for network parameters chosen within one category, we obtain an upper bound for all parameter values by multiplying with the number of categories, which is 3k. This yields the bound |T| · B ( em2k / W ) W 3k .
(4.2)
If there is a set of m vectors that is shattered, then all 2m dichotomies of this set must be induced. Since |T| is an upper bound on this number, this implies m · log B C W log(em2 k / W ) C k log 3. From the well-known inequality ln a · ab C ln(1 / b ) ¡ 1, which holds for all a, b > 0 (see, e.g., Anthony & Bartlett, 1999, appendix A.1.1), we obtain for a D m and b D (ln 2) / (2W ) that W log m · m / 2 C W log(2W / (e ln 2) ). Using this in the above inequality, we get m · log B C m / 2 C W log(2 kC 1 / ln 2) C k log 3, which is equivalent to m · 2 log B C 2Wk C 2W log(2 / ln 2) C 2k log 3. Substitution of the value for B yields m · Wk ( Wk ¡ 1) C 2(Wk C W ) log[2(Wk C W ) C 1] C 2Wk log[2(Wk C W ) ( Wk C W C 1) C 1] C 2Wk C 2W log(2 / ln 2) C 2k log 3.
Simplifying and rearranging, we obtain m · ( Wk) 2 C (2Wk C 2W) log[2(Wk C W ) C 1] C 2Wk log[2(Wk C W ) ( Wk C W C 1) C 1] C Wk C 2W log(2 / ln 2) C 2k log 3.
The second and the third line together are less than or equal to 2Wk(log[13 ( Wk) 2 ] C 1 / 2 C log(2 / ln 2) C log 3) ,
266
Michael Schmitt
p which is equal to 2Wk log[78 2(Wk) 2 / ln 2] and hence less than 4Wk log (13Wk). The last term in the rst line is at most 4Wk log(5Wk) . Using these relations, we arrive at m < (Wk) 2 C 4Wk log(5Wk) C 4Wk log(13Wk) and hence at m < (Wk) 2 C 8Wk log(13Wk) , which completes the proof of the theorem. The constants in the bound of theorem 1 can be made slightly smaller, and one can get rid of the dependency on k of the term in the logarithm. Karpinski and Macintyre (1997) show by means of a more direct application of the method of Warren (1968) that for obtaining a solution set components bound for the class of functions considered here, it is not necessary to introduce additional parameters as intermediate variables as we have done in lemma 2.6 Instead, it is sufcient to employ the bound provided by lemma 1 for afne functions in W variables and Wk xed exponentials. This yields the improved solution set components bound B D 2Wk( Wk¡1 ) / 2 [2W C 1]W [2W (W C 1) C 1]Wk ,
(4.3)
which can be used in the proof of theorem 1 in place of equation (4.1) to derive the following improved version of this theorem. Corollary 1. Let N be a neural network with one hidden layer of k product units and W parameters. Then N has pseudo-dimension at most ( Wk) 2 C 6Wk log(8W) . Proof. Using solution set components bound (4.3) in the proof of theorem 1, we get m · (Wk) 2 C 2W log(2W C 1) C 2Wk log[2W ( W C 1) C 1] C Wk C 2W log(2 / ln 2) C 2k log 3. The last line is less than 4Wk log(8W), and the last term of the rst line is at most 2W log(3W). This implies m < ( Wk) 2 C 6Wk log(8W) . 6 Theorem 7.6 in Anthony and Bartlett (1999), which also builds on the method of Warren (1968), provides a more general method suitable for any function class that satises the conditions of lemma 3. In particular, it applies to networks with more than one hidden layer. For such networks, the direct method of Karpinski and Macintyre (1997) bears no advantage. The latter method also deals with these networks by introducing intermediate variables. We have chosen to employ the result of Anthony and Bartlett (1999) in the proof of lemma 3 because it is more general and makes the proof of theorem 1 shorter and easier to read.
Multiplicative Neural Networks
267
One of the constants can be further improved for networks that operate in the nonnegative domain only. The improvement is marginal, but the result demonstrates how the input restriction affects the calculation of the bound. Corollary 2. Let N be a neural network with one hidden layer of k product units and W parameters. If the inputs for N are restricted to nonnegative real numbers, then N has pseudo-dimension at most ( Wk) 2 C 6Wk log(6W). Proof. If all inputs are positive, then no sign changes have to be taken into account for the output values of the product units. Therefore, we can use in the proof of theorem 1 that each input vector gives rise to 1 instead of 2 k functions. This gives the new bound |T| · B (em / W ) W 3 k for inequality (4.2). Using solution set components bound (4.3) yields m · ( Wk) 2 C 2W log(2W C 1) C 2Wk log[2W ( W C 1) C 1] ¡ Wk C 2W log(2 / ln 2) C 2k log 3. The last line is less than 4Wk log(6W) , implying m < (Wk) 2 C 6Wk log(6W). The networks considered thus far in this section have in common a rigid architecture with a prescribed number of hidden nodes. In some learning applications, however, it is customary not to x the architecture in advance but to let the networks grow. In this case, a variety of networks can result from the learning algorithm. It might be possible to accommodate all these networks in a single large network so that a bound for the VC dimension of the class of networks is obtained in terms of a bound for the large network. Often, however, better bounds can be derived if one takes into account the constraint that underlies the growth of the network. In the following, we assume that this growth is limited by a bound on the fan-out of the input nodes. Such networks with sparse connectivity have been suggested, for instance, by Lee et al. (1986) and Hancock et al. (1994). Corollary 3. Let C be the class of networks with one hidden layer of product units and n input nodes where every input node has fan-out at most l. Then C has pseudo-dimension at most 4(nl) 4 C 18(nl) 2 log(23nl) . Proof. The number of networks in C is not larger than ( nl ) nl since there are at most this many ways to assign nl connections to nl hidden units. Each network has at most 2nl C 1 parameters, where nl are for connections leading to hidden nodes and nl C 1 are for the output node. Thus, with r D nl, the number of dichotomies induced by C , or more precisely, by functions of the form (x, z) 7! sgn ( f (x ) ¡ z ) where f is computed by C , on a set of cardinality m is at most rr times the number of dichotomies induced by a network with r hidden nodes and 2r C 1 parameters. Using solution set
268
Michael Schmitt
components bound (4.3), we get from the proof of corollary 1 with k D r and W D 2r C 1, m · (r (2r C 1) ) 2 C 2(2r C 1) log(2(2r C 1) C 1) C 2r(2r C 1) log[2(2r C 1) (2r C 2) C 1] C r (2r C 1) C 2(2r C 1) log(2 / ln 2) C 2r log 3 C r log r,
where the last term is due to the factor rr . From this we obtain m < 4r4 C 6r log(7r) C 12r2 log(23r) , and hence m < 4r4 C 18r2 log(23r) . Resubstituting r D nl yields the claimed result. 4.2 Networks with Product and Sigmoidal Units. We now consider feedforward architectures with an arbitrary number of layers. The networks are nonhomogeneous in that each node may be a product or a sigmoidal unit independent of the other nodes. Pseudo-dimension bounds for pure product unit networks with the output node being a summing unit are known from the previous section. We noted in section 2.2.2 that a network of product units only is equivalent to a single product unit. A bound for the single product unit will be given in section 4.5. Also, bounds for networks consisting solely of sigmoidal units have been established by Karpinski and Macintyre (1997) and Anthony and Bartlett (1999). In the following, we calculate bounds for networks containing both unit types. Theorem 2. Suppose N is a feedforward neural network with k computation nodes and W parameters, where each computation node is a sigmoidal or a product unit. Then the pseudo-dimension of N is at most 4(Wk) 2 C 20Wk log(36Wk) . Before proving this, we determine a solution set components bound for classes of functions arising from feedforward networks. The following result considering arbitrarily many layers corresponds to lemma 2, which was for networks with one hidden layer only. The proof is given in the appendix. Lemma 4. Let G be the class of real-valued functions in d variables computed by a network with r computation nodes where q of these nodes compute one of the functions a 7! c C 1 / (1 C e¡a ) , a 7! c § ea , or a 7! ln a for some arbitrary constant c, and r ¡ q nodes compute some polynomial of degree 2. Then G has solution set components bound, ( ) B D 2dq dq¡1 / 2 [6(2dr C d) C 2]2dr C d [3(2dr C d ) (2dr C d C 1) C 1]dq.
Besides the above solution set components bound, the following proof uses lemma 3 from the appendix.
Multiplicative Neural Networks
269
Proof of Theorem 2. We show rst that we can conne the argumentation to positive inputs. For some input vector s, consider all functions computed by N that arise when s is fed into the network and the signs of the parameters are varied in all possible ways. Treating input values 0 to product units as in the proof of theorem 1 and taking into account the at most 2W functions thus generated by each input vector, we may henceforth assume without loss of generality that all input values to product units are positive real numbers. (A number of 2k functions is in general not sufcient, as it was in the proof of theorem 1, since changing the sign of the output of some input unit, for instance, can modify the sign of an input to a sigmoidal unit. This cannot be compensated for by changing the sign of the sigmoidal unit, but by changing the sign of a weight.) Thus, the number of dichotomies that are induced by functions of the form (x, z) 7! sgn ( f ( x) ¡ z) with f being computed by N on some arbitrary set f(s1 , u1 ) , . . . , ( sm , um ) g with input vectors s1 , . . . , sm and real numbers W u1 , . . . , um is at most as large as the cardinality of the set T µ f0, 1gm2 dened by T D f(sgn ( f1 ( a, s01 , u1 ) ) , . . . , sgn ( f1 ( a, s0m , um ) ) , . . .
. . . , sgn ( f2W ( a, s01 , u1 ) ) , . . . , sgn ( f2W ( a, s0m , um ) ) ) : a 2 RW g,
where s01 , . . . , s0m are positive real vectors and f1 , . . . , f2W are the functions arising from N after making the sign variations described above. We allow that any arbitrary constant may be added to these functions and use F to denote this class of functions ( w, x, z ) 7! f ( w, x, z ) in the network parameters w and the input variables x and z. Clearly, then, F is closed under addition of constants. Consider the class G D fw 7! f ( w, s0i , ui ) : f 2 F , i D 1, . . . , mg of functions in W variables. Every function in G can be computed by a network where each computation node computes one of the functions a 7! c C 1 / (1 C e¡a ), a 7! c § ea , a 7! ln a, or some polynomial of degree 2. Here, c is some arbitrary constant that is required for the output nodes due to the closedness of F under addition of constants and also accommodates the subtraction of some ui from the output. The “ § ” comes from the sign variation of the output of the product unit. The networks result as follows: Each product unit receives only positive inputs and can thus be written as ci § exp(wi,1 ln vi,1 C ¢ ¢ ¢ C wi,pi ln vi,pi ) , and each sigmoidal unit has the form ci C 1 / (1 C exp(¡wi,1 vi,1 ¡ ¢ ¢ ¢ ¡ wi,pi vi,pi C ti ) ) , where the vi, j are output values of other nodes. Therefore, each product unit can be decomposed into one function a 7! c § e a , one polynomial of degree 2,
270
Michael Schmitt
and functions a 7! ln a applied to the outputs of other nodes. Further, each sigmoidal unit can be decomposed into one function a 7! c C 1 / (1 C e¡a ) and one polynomial of degree 2. This leads to a network with at most k ¡1 nodes computing a 7! ln a (the logarithm of the output node is not needed), at most k nodes computing a polynomial of degree 2, and at most k nodes each of which computes either a 7! c § e a or a 7! c C 1 / (1 C e¡a ). In total, every function in G can be computed by a network with 3k ¡ 1 computation nodes of which 2k ¡ 1 nodes compute one of the functions a 7! c C 1 / (1 C e¡a ) , a 7! c § e a , or a 7! ln a. Lemma 4 with d D W, r D 3k ¡ 1, and q D 2k ¡ 1 shows that G has solution set components bound, B D 2W (2 k¡1 ) ( W (2k¡1 ) ¡1) / 2 [6(2W (3k ¡ 1) C W ) C 2]2W(3k¡1) C W ¢ [3(2W (3 k ¡ 1) C W ) (2W (3 k ¡ 1) C W C 1) C 1]W (2 k¡1 ) . Since F is closed under addition of constants, it follows from lemma 3 that B (2em2W / W ) W is an upper bound on the cardinality of T and, thus, on the number of dichotomies induced on any set of cardinality m. If such a set is shattered, this implies m · log B C W log(em2W / W ). Using W log m · m / 2 C W log(2W / ( e ln 2) ) (see the proof of theorem 1), we obtain m · 2 log B C 2W 2 C 2W log(2 / ln 2) , and after substituting the value for B, m · W (2 k ¡ 1) ( W (2 k ¡ 1) ¡ 1) C (4W (3 k ¡ 1) C 2W) log[6(2W (3 k ¡ 1) C W ) C 2] C 2W (2 k ¡ 1) log[3(2W (3 k ¡ 1) C W ) (2W (3 k ¡ 1) C W C 1) C 1] C 2W 2 C 2W log(2 / ln 2) .
Simplifying and rearranging leads to m · 4(Wk) 2 ¡ 4W 2 k C 3W 2 C (12Wk ¡ 2W) log[6(6Wk ¡ W ) C 2]) C 2W (2 k ¡ 1) log[3(6Wk ¡ W ) (6Wk ¡ W C 1) C 1] ¡ W (2 k ¡ 1) C 2W log(2 / ln 2) .
The last two lines together are less than 4Wk(log[109(Wk) 2 ] C log(2 / ln 2) ) , which equals 4Wk log[218 ( Wk) 2 / ln 2] and is less than 8Wk log(18Wk) . The last three terms of the rst line together are less than 12Wk log(36Wk) . Thus, we may conclude that m < 4(Wk) 2 C 20Wk log(36Wk) .
Multiplicative Neural Networks
271
The networks considered in theorem 2 have xed units in the sense that it is determined in advance for a computation node whether it is to be a product or a sigmoidal unit. One can imagine situations in which the learning algorithm chooses the unit type for each node. Then the function class is no longer represented by a single network, but by a class of networks. An inspection of the proof of theorem 2 in this regard shows that its argumentation does not depend on knowing which unit type is given. Hence, the same bound holds if the unit type of each computation node is variable. Corollary 4. Let C be a class of feedforward neural networks where each network has k computation nodes and W parameters, and where each computation node is a product unit or a sigmoidal unit. Then C has pseudo-dimension at most 4(Wk) 2 C 20Wk log(36Wk) . 4.3 Higher-Order Sigmoidal Networks. A neural network consisting of higher-order sigmoidal units can be considered as a network of product and sigmoidal units where the weights of the product units are restricted to the nonnegative integers. Thus, an upper bound on the VC dimension or pseudo-dimension for a higher-order network is obtained by means of the same network with the monomials replaced by product units and the exponents considered as variables. We distinguish between two ways of viewing higher-order sigmoidal networks: the exponents of the monomials can be parameters of the network, or they can be xed. We consider the latter case rst, that is, when the exponents are not variable. We emphasize that we do not explicitly count the number of monomials in a higher-order network. For both cases, we obtain bounds that do not impose any restriction on the order of the monomials. Theorem 3. Suppose N is a network with k computation nodes and W parameters where each computation node is a higher-order sigmoidal unit with xed exponents. Then N has pseudo-dimension at most 36 ( Wk) 4 C 136 ( Wk) 2 log(12Wk) . Proof. The parameters of a sigmoidal higher-order network are the weights and thresholds of the sigmoidal units only. Further, a sigmoidal higher-order network has the following properties: First, every input node and every nonoutput node that is a sigmoidal unit feeds its output only to monomials. Second, every sigmoidal unit receives its input only from monomials through parameterized connections. Thus, after replacing the monomials by product units, we can guide the proof analogously to that of theorem 2 with the following ingredients: Since sign variations of input vectors affect product units only and there are at most W product units, it sufces to consider as upper bound on the number of dichotomies that N induces via functions of the form (x, z) 7! sgn ( f ( x ) ¡ z ) on a set of cardi-
272
Michael Schmitt W
nality m the cardinality of a set T with T µ f0, 1gm2 dened as in the proof of theorem 2. The networks that give rise to the function class G have at most: k nodes computing the function a 7! c C 1 / (1 C e ¡a ) (due to the sigmoidal units) W nodes computing the function a 7! c § e¡a (due to the product units) k nodes computing the function a 7! ln a (only logarithms of sigmoidal units are needed) W C k nodes computing a polynomial of degree 2 (due to the product and sigmoidal units) This yields networks having at most 2W C 3k · 5Wk nodes of which up to W C 2k · 3Wk compute a nonpolynomial function. Since each product unit, receiving inputs from sigmoidal units only, has at most k exponents, the functions in G have in total at most W C Wk · 2Wk variables. Thus, with these assumptions, we get from lemma 4 using d D 2Wk, r D 5Wk, and q D 3Wk the solution set components bound, B D 26(Wk)
2 (6( Wk) 2 ¡1)
/ 2 [6(20(Wk) 2 C 2Wk) C 2]20(Wk) 2 C 2Wk 2
¢ [3(20(Wk) 2 C 2Wk) (20( Wk) 2 C 2Wk C 1) C 1]6(Wk) . W
As in the proof of theorem 2, we infer from lemma 3 using T µ f0, 1gm2 that B ( em2W / (2Wk) ) 2Wk is an upper bound on the number of dichotomies induced by N on any set of cardinality m. Thus, if such a set is shattered, we have m · log B C 2Wk log(em2W / (2Wk) ) , from which we get m · 2 log B C 4W 2 k C 4Wk log(2 / ln 2) using 2Wk log m · m / 2 C 2Wk log(4Wk / (e ln 2) ) (see the proof of theorem 1). With the above solution set components bound, this implies m · 6(Wk) 2 (6( Wk) 2 ¡ 1)
C (40( Wk) 2 C 4Wk) log[6(20(Wk) 2 C 2Wk) C 2] C 12(Wk) 2 log[3(20(Wk) 2 C 2Wk) (20( Wk) 2 C 2Wk C 1) C 1] C 4W 2 k C 4Wk log(2 / ln 2) .
Multiplicative Neural Networks
273
From this, we obtain m · 36(Wk) 4 C (40( Wk) 2 C 4Wk) log[6(20(Wk) 2 C 2Wk) C 2]
C 12 ( Wk) 2 log[3(20(Wk) 2 C 2Wk) (20( Wk) 2 C 2Wk C 1) C 1]
¡ 6(Wk) 2 C 4W 2 k C 4Wk log(2 / ln 2) . The last two lines together are less than 12 ( Wk) 2 (log[1519 ( Wk) 4 ] C log(2 / ln 2) ) ,
which is equal to 12 (Wk) 2 log[3038 ( Wk) 4 / ln 2] and less than 48 (Wk) 2 log(9Wk) . Since the second term of the rst line is less than 88 (Wk) 2 log(12Wk) , we get m < 36 ( Wk) 4 C 136 ( Wk) 2 log(12Wk) as claimed. The bound we derive next concerns higher-order sigmoidal networks with variable exponents. As is to be expected, the bound is smaller since more parameters are counted. Theorem 4. Suppose N is a higher-order sigmoidal network with k computation nodes and W parameters that include the exponents of the monomials. Then the pseudo-dimension of N is at most 9(W 2 k) 2 C 34W 2 k log(68W 2 k) . Proof. In comparison to the case with xed exponents in theorem 3, the difference here is that the exponents of the monomials do not increase the number of parameters since they are already counted in W. Thus, lemma 4, with d D W, r D 5Wk, and q D 3Wk, provides solution set components bound B D 23W
2
k (3W 2 k¡1) / 2 [6(10W 2
k C W ) C 2]10W
2 C k W 2
¢ [3(10W 2 k C W ) (10W 2 k C W C 1) C 1]3W k,
and lemma 3 yields B ( em2W / W ) W as upper bound on the number of dichotomies induced on a set of cardinality m. Assuming that such a set is shattered, we obtain m · 2 log B C 2W 2 C 2W log(2 / ln 2) similarly as in the proof of theorem 2. After substituting the above value for B, we get m · 3W 2 k (W 2 k ¡ 1) C (20W 2 k C 2W) log[6(10W 2 k C W ) C 2] C 6W 2 k log[3(10W 2 k C W ) (10W 2 k C W C 1) C 1] C 2W 2 C 2W log(2 / ln 2),
274
Michael Schmitt
which implies m · 9(W 2 k) 2 C (20W 2 k C 2W) log[6(10W 2 k C W ) C 2]
C 6W 2 k log[3(10W 2 k C W ) (10W 2 k C W C 1) C 1]
¡ 3W 2 k C 2W 2 C 2W log(2 / ln 2) . The last two lines together are less than 6W 2 k (log[397 W 4 k2 ] C log(2 / ln 2) ) ,
which is equal to 6W 2 k log[794W 4 k2 / ln 2] and less than 12W 2 k log(34W 2 k) . The second term of the rst line is less than 22W 2 k log(68W 2 k) . Thus, we have m < 9(W 2 k) 2 C 34W 2 k log(68W 2 k) . 4.4 Single Higher-Order Units. Since a single unit can be viewed as a small network, bounds on the VC dimension and pseudo-dimension for product units and higher-order units can be obtained from previous sections. For particular cases, however, we shall establish signicant improvements in the following. We look at single higher-order units and classes of these units rst; then, we consider single product units and classes of monomials. We recall from section 2.2.2 the denition of a higher-order unit, which has the form w1 M1 C w2 M2 C ¢ ¢ ¢ C wk M k ¡ t, where M1 , . . . , M k are monomials. We also remember that the set fM1 , . . . , M kg is called the structure of the higher-order unit. We can view such a unit as a network with one hidden layer and a linear unit as output node. If a threshold or sigmoidal unit is employed for the output node, we have the higher-order variants of the threshold and sigmoidal unit, respectively, which were also dened in section 2.2.2. According to proposition 1, when studying the VC dimension it makes no difference which summing unit is employed for the output node. Thus, we may focus on linear output units without loss of generality. If the structure of a higher-order unit is xed, its only parameters are the weights and the threshold of the output node. Thus, with xed structure, the VC dimension cannot be larger than the VC dimension of a summing unit with the same number of parameters. This fact is employed in the following result. Lemma 5. Let N be a higher-order unit with xed structure that consists of k monomials. Then the number of dichotomies that N induces on a set of cardinality
Multiplicative Neural Networks
275
m is at most 2
´ k ³ X m¡1 iD 0
i
³
e (m ¡ 1) < 2 k
´k ,
for m > k ¸ 1, and the VC dimension of N
is at most k C 1.
Proof. Let fM1 , . . . , M kg be the structure of the higher-order unit. Assume further that S is a set of m input vectors and consider the set of vectors S0 D f(M1 ( s) , . . . , M k ( s ) ) : s 2 Sg. Obviously, every dichotomy induced by a summing unit on S0 corresponds to at least one dichotomy induced by N on S. Hence, the number of dichotomies that N induces on S cannot be larger than the number of dichotomies that a summing unit with k input variables induces on S0 . The P ), an expression that latter quantity is known to be not larger than 2 ikD 0 ( m¡1 i is less than 2(e(m ¡ 1) / k) k (see Anthony & Bartlett, 1999, theorems 3.1 and 3.7, respectively). Thus, we have the claimed upper bound on the number of dichotomies. The VC dimension of a summing unit with k input variables is known to be k C 1 (Anthony & Bartlett, 1999, section 3.3). We apply this result in the next two theorems, where we consider higherorder units with variable structure. The variability is given in terms of a class of units that underlie a certain connectivity constraint. In the rst class, the fan-in of the output node, or the number of monomials, is limited. In the second class, we have a bound on the fan-out of the input nodes, or the number of monomials in which each variable may occur. Both classes can be considered as multivariate generalizations of a function class studied by Karpinski and Werther (1993). They dene a polynomial to be t-sparse if it has at most t nonzero coefcients. Thus, a t-sparse univariate polynomial has at most t monomials, and each variable occurs in at most t of them. Karpinski and Werther (1993) show that the class of t-sparse univariate polynomials has VC dimension H ( t) . Theorem 5. Let C be the class of higher-order units with n input variables and at most k monomials where each variable has an exponent of value at most d. Then C has VC dimension at most minf(k(nk C k C 1) ) 2 C 6k(nk C k C 1) log(8(nk C k C 1) ) , 2nk log(9d) g. Proof. Obviously, a network with one hidden layer of product units and a linear unit as output node comprises the set of functions computed by the
276
Michael Schmitt
units in C . Thus, the rst bound is obtained from corollary 1 considering the exponents of the variables as parameters and using the fact that a product unit can compute any monomial. Since in a higher-order unit with at most k monomials there are no more than nk occurrences of variables, this leads to a total number of nk C k C 1 parameters. The second bound is established as follows: Each occurrence of a variable can have an exponent from f0, 1, . . . , dg (where we consider a monomial to be 0 if all its variables have exponent 0). Therefore, (d C 1) nk is an upper bound on the number of structures in C . From this bound and lemma 5, we infer that the number of dichotomies induced by C on a set S of m input vectors is at most ³ ( d C 1) nk ¢ 2
e (m ¡ 1) k
´k .
If S is shattered, its cardinality satises m · nk log(d C 1) C k log(e(m ¡ 1) / k) C 1. Now we use that ln a · ab C ln(1 / b ) ¡ 1 for a, b > 0 (see, e.g., Anthony & Bartlett, 1999, appendix A.1.1). Assuming m > 1, we may substitute a D m¡1 and b D (ln 2) / (2 k) to obtain k log(m¡1) · ( m¡1 ) / 2 C k log(2k / ( e ln 2) ) . From this we have m · 2nk log(d C 1) C 2k log(2 / ln 2) C 1. The right-hand side p is at most 2nk(log(d C 1) C log(2 / ln 2) C 1 / 2) , which is equal to 2nk log(2 2(d C 1) / ln 2) , and this is less than 2nk log(9d) . Theorem 6. Let C be the class of higher-order units with n input variables where each variable occurs in at most l monomials and has an exponent of value at most d. Then C has VC dimension at most minf4(nl) 4 C 18 (nl ) 2 log(23nl) , 2n2 l log(9d) , 2nl log(5dnl)g. Proof. The rst bound is due to corollary 3 by considering a higher-order unit as a network with one hidden layer of product units and a linear unit as output node. The second bound is obtained from theorem 5 using the fact that every unit in C has at most nl monomials. We derive the third bound as follows: Since there are at most nl connections between the input nodes and the monomials, there are at most (nl ) nl possibilities of connecting the input nodes with the monomials. Further, there are at most dnl ways to assign exponents from f1, . . . , dg to the
Multiplicative Neural Networks
277
occurrences of the variables. Thus, there are at most (dnl ) nl different structures in C . This bound together with lemma 5 implies that the number of dichotomies induced by C on a set S of cardinality m is not larger than ³ (dnl ) nl ¢ 2
e (m ¡ 1) nl
´nl .
This expression is equal to 2(ed(m ¡ 1) ) nl . If S is shattered by C , it follows that m · nl log(ed(m ¡ 1) ) C 1. Similarly as in the previous proof, assuming m > 1, we can make use of nl log(m ¡ 1) · ( m ¡ 1) / 2 C nl log(2nl / ( e ln 2) ) to obtain m · 2nl log(2dnl / ln 2) C 1. The right-hand side is not larger than 2nl(log(2dnl / ln 2) C 1 / 2), which equals p 2nl log(2 2dnl / ln 2) . Since this is less than 2nl log(5dnl) , the third bound follows. 4.5 Single Product Units and Monomials. Next we look at a single product unit. Since it can be viewed as a (trivial) network with one hidden unit, an upper bound on its VC dimension is immediately obtained from corollary 1 in the form of n2 C 6n log(8n) , where n is the number of variables. The following statement shows that the exact values for its VC dimension and pseudo-dimension are considerably smaller. This result also contains the rst lower bound of this article. Theorem 7. The VC dimension and the pseudo-dimension of a product unit with n input variables are both equal to n. Proof. That n is a lower bound easily follows from the fact that a monomial with n variables shatters the set of unit vectors from f0, 1gn , that is, the set of vectors with a 1 in exactly one position. We show now that n is an upper bound on the pseudo-dimension, so that the theorem follows. The idea is to derive the upper bound by means of the pseudo-dimension of a linear unit. Let (w, x) 7! f ( w, x) be the function computed by a product unit, that is, wn 1 w2 f ( w, x ) D xw 1 x2 ¢ ¢ ¢ xn ,
where w1 , . . . , wn are parameters and x1 , . . . , xn are input variables. Consider some arbitrary set S D f(s1 , u1 ) , . . . , ( sm , um ) g where si 2 Rn and ui 2 R for i D 1, . . . , m. According to the denition, the pseudo-dimension of a
278
Michael Schmitt
product unit is the cardinality of the largest such S that is shattered by functions of the form ( w, x, y ) 7! sgn ( f (w, x ) ¡ y )
(4.4)
with parameters w1 , . . . , wn and input variables x1 , . . . , xn , y. We use the same idea as in the proof of theorem 1 to get rid of negative input values. According to the assumptions on the parameters (see section 3.1), if some input value is negative, its weight may take on integer values only. The sole effect of changing the sign of an input value, therefore, is possibly to change the sign of the output of the product unit. Hence, if we consider the set S0 D f(s01 , u1 ) , . . . , ( s0m , um ) g, where s0i arises from si by taking absolute values in all components, then the number of dichotomies induced on S is less than or equal to the cardinality of the set T µ f0, 1g2m dened by T D f(sgn ( f ( a, s01 ) ¡ u1 ) , . . . , sgn ( f ( a, s0m ) ¡ um ) , sgn (¡ f (a, s01 ) ¡ u1 ) , . . . , sgn (¡ f ( a, s0m ) ¡ um ) ) : a 2 Rn g. Since we are interested in an upper bound for |T|, we may assume without loss of generality that no s0i has some component equal to 0, because in that case, the value sgn ( f (a, s0i ) ¡ ui ) is the same for all a 2 Rn . Thus, for inputs from S0 , the function f can be written as f ( w, x ) D exp(w1 ln x1 C ¢ ¢ ¢ C wn ln xn ) . Since f (a, s0i ) > 0 for every a 2 Rn and i D 1, . . . , m, we may suppose that each of u1 , . . . , um is different from 0. Now, depending on whether ui > 0 or ui < 0, exactly one of sgn ( f ( a, s0i ) ¡ ui ) , sgn (¡ f ( a, s0i ) ¡ ui ) changes when a varies, while the other one remains constant. Hence, by dening » bi D
1 ¡1
if ui > 0, if ui < 0,
we select the varying components and obtain with T 0 D f(sgn ( b1 f ( a, s01 ) ¡ u1 ) , . . . , sgn ( bm f (a, s0m ) ¡ um ) ) : a 2 Rn g a set T 0 µ f0, 1gm that has the same cardinality as T. Consider the function (w, x) 7! g ( w, x ) dened for positive input vectors x as g (w, x ) D w1 ln x1 C ¢ ¢ ¢ C wn ln xn ,
Multiplicative Neural Networks
279
that is, g D ln ± f . If ui > 0, then sgn ( bi f ( a, s0i ) ¡ ui ) D sgn( f ( a, s0i ) ¡ bi ui )
0 D sgn(ln( f ( a, si ) ) ¡ ln(bi ui ) )
D sgn( g ( a, s0i ) ¡ ln(bi ui ) ) ,
and if ui < 0, then sgn ( bi f ( a, s0i ) ¡ ui ) D sgn(¡ f (a, s0i ) C bi ui )
D sgn(¡ ln( f ( a, s0i ) ) C ln(bi ui ) ) D sgn(¡g(a, s0i ) C ln(bi ui ) ).
This implies that sgn ( bi f ( a, s0i ) ¡ ui ) D sgn(bi g (a, s0i ) ¡ bi ln(bi ui ) ) for every a 2 Rn and i D 1, . . . , m. From this, we have that |T 0 | is not larger than the number of dichotomies induced on the set S00 D f(b1 ln(s01 ) , b1 ln(b1 u1 ) ) , . . . , ( bm ln(s0m ) , bm ln(bm um ) )g (where logarithms of vectors are taken component-wise) by functions of the form (w, x, y ) 7! sgn ( w1 x1 C ¢ ¢ ¢ C wn xn ¡ y ) .
(4.5)
To conclude the proof, assume that some set of cardinality n C 1 is shattered by functions of the form (4.4). Then from the reasoning above, we may infer that some set of cardinality n C 1 is shattered by functions of the form (4.5). This, however, contradicts the fact that the pseudo-dimension of a linear unit with n parameters is at most n (see, e.g., Anthony & Bartlett, 1999, theorem 11.6). Since a single product unit can compute any monomial, the previous result implies that the VC dimension and the pseudo-dimension of the class of monomials do not grow with the degree of the monomials but are identical with the number of variables. Corollary 5. The VC dimension and the pseudo-dimension of the class of monomials with n input variables are both equal to n.
280
Michael Schmitt
5 Lower Bounds The results presented thus far exclusively dealt with upper bounds, except for the single product unit and the class of monomials for which a lower bound has been given in section 4.5. In the following, we establish some further lower bounds for the VC dimension, which are then, by denition, also lower bounds for the pseudo-dimension. The results in section 5.1 concern classes of higher-order units and show that some upper bounds given in section 4.4 cannot be improved in a certain sense. The main result of section 5.2 is a superlinear lower bound for networks that consist of product and linear units and have constant depth. 5.1 Higher-Order Units. Two types of restrictions have been considered for classes of higher-order units in section 4.4. First, a bound k was imposed on the number of monomials or, equivalently, on the fan-in of the output node. Second, the number of occurrences of each variable or, equivalently, the fan-out of the input nodes was limited by some bound l. We give a lower bound for the latter class rst. The following result provides the essential means. Theorem 8. Let m, r ¸ 1 be natural numbers. Suppose C is the class of higherorder units with m C 2r variables, where each variable occurs in at most one monomial and, if so, with exponent 1. Then there is a set of cardinality m ¢ r that is shattered by C . r
Proof. We show that the class C shatters some set S µ f¡1, 1gm C 2 , which is constructed as the direct product of a set U µ f¡1, 1gm and a set V µ r f¡1, 1g2 . First, let U D fu1 , . . . , um g be dened by » ui, j D
¡1 1
if i D j, otherwise,
for i, j D 1, . . . , m, where ui, j denotes the jth component of ui . Second, given an enumeration L1 , . . . , L2r of all subsets of the set f1, . . . , rg, we dene V D fv1 , . . . , vr g by » v k, j D
¡1 1
if k 2 Lj , otherwise,
for k D 1, . . . , r and j D 1, . . . , 2r . Then the set S D fui : i D 1, . . . , mg £ fv k: k D 1, . . . , rg obviously has cardinality m ¢ r.
Multiplicative Neural Networks
281
To verify that S is shattered by C , assume that some dichotomy ( S 0, S1 ) of S is given. We denote the m C 2r input variables of the units in C by x1 , . . . , xm , y1 , . . . , y2r such that x1 , . . . , xm receive inputs from U and y1 , . . . , y2r receive inputs from V. We construct monomials M1 , . . . , M2r in these variables as follows: Let the function h: f1, . . . , mg ! f1, . . . , 2r g satisfy Lh ( i) D fk: ui v k 2 S1 g, where ui v k is the vector resulting from the concatenation of ui and vk . Clearly, h is well dened. Then we build the monomials by dening Mj D yj ¢
Y xi i: h (i ) D j
for j D 1, . . . , 2r . Obviously, every variable occurs in at most one monomial r and with exponent 1. Hence, the function f : Rm C 2 ! R with f ( x 1 , . . . , x m , y 1 , . . . , y 2 r ) D M1 C M2 C ¢ ¢ ¢ C M2 r can be computed by a member of C . We claim that the function sgn ± f induces the dichotomy ( S 0 , S1 ) . Let ui v k be some element of S. Since k occurs in exactly half of the sets L1 , . . . , L2r , we have from the denition of v k that vk,1 C v k,2 C ¢ ¢ ¢ C vk,2r D 0. 6 Lh ( i ) according to the denition of h. Then v k satises If ui vk 2 S 0 , then k 2 v k,h ( i) D 1. Since ui,i is the only component of ui with value ¡1 and xi occurs only in monomial Mh ( i) , we have M h ( i) ( ui v k ) D ¡1 and thus f ( ui vk ) D ¡2. On the other hand, if ui v k 2 S1 , then k 2 Lh ( i) and v k,h ( i) D ¡1. Now Mh (i) ( ui v k ) D 1 implies f ( ui v k ) D 2. Thus, ( S 0, S1 ) is induced by sgn ± f .
We can now derive a lower bound for the class of higher-order units where the number of occurrences of the variables is restricted. The result implies that the bound O ( nl log(dnl) ) , where n is the number of variables, l the number of occurrences, and d the largest degree, given in theorem 6 is asymptotically optimal with respect to n. Moreover, this optimality holds even if each variable is allowed to occur at most once and with exponent 1. By bxc we denote the largest integer less than or equal to x. Corollary 6. Suppose C is the class of higher-order units with n variables, where each variable occurs in at most one monomial and, if so, with exponent 1. Then the VC dimension of C is at least bn / 2c ¢ blog(n / 2) c.
282
Michael Schmitt
Proof. If m D bn / 2c and r D blog(n / 2) c, then m C 2r · n. Hence, theorem 8 shows that there exists a set S µ Rn of cardinality m ¢ r D bn / 2c ¢ blog(n / 2)c that is shattered by C , even if each variable has exponent 1. The next result paves the way to a lower bound for the class of higherorder units with a limited number of monomials. Theorem 9. Let m, r ¸ 1 be natural numbers. Suppose C is the class of higherorder units with m C 2r variables such that each unit consists of at most 2r monomials and each variable occurs with exponent 1 only. Then there is a set of cardinality m ¢ 2r that is shattered by C . Proof. We construct S µ f¡1, 0, 1gm C 2r as the direct product of two sets U µ f¡1, 1gm and V µ f0, 1g2r . As in the previous proof, we dene U D fu1 , . . . , um g with ui, j , the jth component of ui , being » ui, j D
¡1 1
if i D j, otherwise,
for i, j D 1, . . . , m. For the denition of V D fv1 , . . . , v2r g, let L1 , . . . , L2r be an enumeration of all subsets of the set f1, . . . , rg. Then the components of v k are dened by v k, j
» 1 D 0
if j 2 L k, otherwise,
» and
v k,r C j D
0 1
if j 2 L k, otherwise,
for k D 1, . . . , 2r and j D 1, . . . , r. Clearly, the set S D fui : i D 1, . . . , mg £ fv k: k D 1, . . . , 2r g has cardinality m ¢ 2r . It remains to show that C shatters S. Let ( S 0 , S1 ) be some arbitrary dichotomy of S. Denote the input variables by x1 , . . . , xm , y1 , . . . , y2r such that x1 , . . . , xm and y1 , . . . , y2r receive inputs from U and V, respectively. First, we dene monomials N1 , . . . , N2r in the variables y1 , . . . , y2r by Nk D
Y j2Lk
yj ¢
Y 6 j 2L k
yr C j
for k D 1, . . . , 2r . Next, we use them to construct monomials M1 , . . . , M2r dened by Mk D N k ¢
Y i: ui vk 2S1
xi
Multiplicative Neural Networks
283
for k D 1, . . . , 2r , where ui v k denotes the concatenation of ui and v k. Then the function f : Rm C 2r ! R with f ( x1 , . . . , xm , y1 , . . . , y2r ) D ¡M1 ¡ ¢ ¢ ¢ ¡ M2r can clearly be computed by a higher-order unit in C . We show that ( S 0, S1 ) is induced by sgn ± f . Let ui v k be some element of S. The denitions of v k and the monomials N1 , . . . , N2r imply that Nl (ui v k ) D
» 1 0
if l D k, otherwise,
for l D 1, . . . , 2r . Hence, we have f ( ui v k ) D ¡M k ( ui v k ) D ¡
Y
ui,h ,
h: uh vk 2S1
which together with the denition of ui implies » f ( ui v k ) D
¡1 1
if ui v k 2 S 0, if ui v k 2 S1 .
Thus, sgn ± f induces the dichotomy (S 0 , S1 ) as claimed, and, consequently, S is shattered by C . Finally, we obtain a lower bound for the class of higher-order units with a restricted number of monomials. In particular, the result shows that the bound O ( nk log d ) , where k is the number of monomials, obtained in theorem 5 cannot be improved with respect to nk. Corollary 7. Suppose C is the class of higher-order units with n variables where each variable has exponent 1 and each unit consists of at most k monomials for some arbitrary k satisfying k · 2n/ 4 . Then C has VC dimension at least bn / 2c ¢ bk / 2c. Proof. Let r be the largest integer satisfying 2r · k. Clearly, then, 2r > bk/ 2c. If we choose m D bn / 2c, then m C 2r · bn / 2c C 2 log k · n, and theorem 9 implies that there is a set S µ Rn of cardinality m¢2r ¸ bn / 2c¢bk/ 2c that is shattered by the class of higher-order units with at most 2r · k monomials where each variable has exponent 1. We conclude by observing that the previous result also yields a lower bound for the class of higher-order units with restricted fan-out of the input nodes since, clearly, each variable occurs in at most k monomials.
284
Michael Schmitt
5.2 Networks with Product Units. Multiplication is certainly the simplest type of arithmetical operation that can be performed by a product unit. All weights just need to be set to 1. Koiran and Sontag (1997) show that there exist networks consisting of linear and multiplication units that have VC dimension quadratic in the number of weights. Hence, this bound remains valid when product units are used instead of multiplication units, and corollary 1 of Koiran and Sontag (1997) implies that for every W, there is a network with O ( W ) weights that consists only of linear and product units and has VC dimension W 2 . This lower bound is based on the use of networks with unrestricted depth. An extension of the result of Koiran and Sontag (1997) is obtained by Bartlett et al. (1998), who give a lower bound for layered sigmoidal networks in terms of the number of weights and the number of layers. Using the constructions of Koiran and Sontag (1997) and Bartlett et al. (1998) in terms of linear and multiplication units, we deduce that for every L and sufciently large W, there is a network with L layers and O ( W ) weights that consists only of linear and product units and has VC dimension at least bL / 2c¢bW / 2c. Thus, in terms of the number of weights, we have a quadratic lower bound for arbitrary networks and a linear lower bound for networks of constant depth. It is known, however, that networks of summing units can have constant depth and superlinear VC dimension. For threshold units, such networks have been constructed by Sakurai (1993) and Maass (1994). We show now that product unit networks of constant depth also can have a superlinear VC dimension. In particular, we establish this for networks consisting of product and linear units and having two hidden layers. The numbering of the hidden layers in the following statement is done from the input nodes toward the output node. Theorem 10. Let n, k be natural numbers satisfying k · 2n C 2 . There is a network N with the following properties: It has n input nodes, at most k hidden nodes arranged in two layers with product units in the rst hidden layer and linear units in the second, and a product unit as output node; furthermore, N has 2nbk / 4c adjustable and 7bk/ 4c xed weights. The VC dimension of N is at least ( n ¡ blog(k / 4) c) ¢ bk / 8c ¢ blog(k / 8) c. With the aim of proving this, we rst establish a lemma in which we introduce a new kind of summing unit and make use of a property of sets of vectors. A set of m vectors in Rn is said to be in general position if every subset of at most n vectors is linearly independent. Obviously, a set in general position can be constructed for any m and n. The new summing unit has weights and a threshold as parameters and computes its output by applying the activation function t ( y) D 1 C 1 / cosh(y) to the weighted sum. This function has its maximum at y D 0 with t (0) D 2 and satises lim t ( y ) D 1 for y ! ¡1 as well as for y ! 1. Further, t ( y ) ¸ 1 always holds.
Multiplicative Neural Networks
285
Lemma 6. Let h, m, r be arbitrary natural numbers. Suppose N is a network with m C r input nodes, one hidden layer of h C 2r nodes that are summing units with activation function 1 C 1 / cosh, and a monomial as output node. Then there is a set of cardinality h ¢ m ¢ r that is shattered by N . Proof. The construction is based on methods due to Sakurai and Yamasaki (1992) and Sakurai (1993). We choose a set fs1 , . . . , sh¢m g µ Rm in general position and let e1 , . . . , er be the unit vectors in Rr , that is, they have a 1 in exactly one component and 0 elsewhere. Clearly, then, the set S D fsi : i D 1, . . . , h ¢ mg £ fej : j D 1, . . . , rg is a subset of Rm C r with cardinality h ¢ m ¢ r. We show that it can be shattered by the network N as claimed. Assume that ( S 0 , S1 ) is a dichotomy of S. Let L1 , . . . , L2r be an enumeration of all subsets of the set f1, . . . , rg, and dene the function g: f1, . . . , h ¢ mg ! f1, . . . , 2r g to satisfy L g ( i) D fj: si ej 2 S1 g, where si ej denotes the concatenated vectors si and ej . For l D 1, . . . , 2r let Rl µ fs1 , . . . , sh¢m g be the set Rl D fsi : g ( i) D lg. For each Rl , we use d|Rl | / me hidden nodes of which we dene the weights as follows: We partition Rl into d|Rl | / me subsets Rl,p , p D 1, . . . , d|Rl | / me, each of which has cardinality m, except for possibly one set of cardinality less than m. For each subset Rl,p there exist real numbers wl,p,1 , . . . , wl,p,m , tl,p such that every si 2 fs1 , . . . , sh¢m g satises (wl,p,1 , . . . , wl,p,m ) ¢ si ¡ tl,p D 0 if and only if si 2 Rl,p .
(5.1)
This follows from the fact that the set fs1 , . . . , sh¢m g is in general position. (In other words, (wl,p,1 , . . . , wl,p,m , tl,p ) represents the hyperplane passing through all points in Rl,p and through none of the other points.)7 With subset Rl,p , we associate a hidden node with threshold tl,p and with weights wl,p,1 , . . . , wl,p,m for the connections from the rst m input nodes. Since of 7 If |Rl,p | D m, then this hyperplane is unique. In case |Rl,p | < m, we select one of the hyperplanes containing Rl, p and none of the other points. This can be done, for example, by extending the unique (|Rl,p | ¡ 1)-dimensional hyperplane determined by Rl,p to an appropriate (m ¡ 1)-dimensional hyperplane.
286
Michael Schmitt
all subsets Rl,p at most h have cardinality m and at most 2r have cardinality less than m, this construction can be done with at most h C 2r hidden nodes. Thus far, we have specied the weights for the connections outgoing from the rst m input nodes. The connections from the remaining r input nodes are weighted as follows: Let e > 0 be a real number such that for every si 2 fs1 , . . . , sh¢m g and every weight vector ( wl,p,1 , . . . , wl,p,m , tl,p ), 6 Rl,p then | ( w l,p,1 , . . . , wl,p,m ) ¢ s i ¡ tl,p | > e. if si 2
According to the construction of the weight vectors in equation 5.1, such an e clearly exists. We dene the remaining weights wl,p,m C 1 , . . . , wl,p,m C r by » wl,p,m C j D
0 e
if j 2 Ll , otherwise.
(5.2)
This completes the denition of the hidden nodes. We show that they have the following property: Claim. If si ej 2 S1 , then there is exactly one hidden node with output value 2; if si ej 2 S 0 , then all hidden nodes yield an output value less than 2. In order to establish this we observe that according to equation 5.1, there is exactly one weight vector ( wl,p,1 , . . . , wl,p,m , tl,p ), where l D g (i) , that yields 0 on si . If si ej 2 S1 , then j 2 L g ( i) , which together with equation 5.2 implies that the weighted sum (wl,p,m C 1 , . . . , wl,p,m C r ) ¢ ej is equal to 0. Hence, this node gets the total weighted sum 0 and, applying 1 C 1 / cosh, outputs 2. The input vector ej changes the weighted sums of the other nodes by an amount of at most e . Thus, the total weighted sums for these nodes remain different from 0, and, hence, the output values are less than 2. 6 L g ( i ) , and the node that yields 0 On the other hand, if si ej 2 S 0 , then j 2 on si receives an additional amount e through weight wl,p,m C j . This gives a total weighted sum different from 0 and an output value less than 2. All other nodes fail to receive 0 by an amount of more than e and thus have total weighted sum different from 0 and, hence, an output value less than 2. Thus, the claim is proven. Finally, to complete the proof, we do one more modication with the weight vectors and dene the weights for the ouptut node. Clearly, if we multiply all weights and thresholds dened thus far with any real number a > 0, the claim above remains true. Since lim(1 C 1 / cosh(y) ) D 1 for y ! ¡1 and y ! 1, we can nd an a such that on every si ej 2 S, the output values of those hidden nodes that do not output 2 multiplied together yield a value as close to 1 as necessary. Further, this value is at least 1, since 1 C 1 / cosh(y) ¸ 1 for all y. Thus, if we employ a monomial with all exponents equal to 1 for the output node, it follows from the reasoning above that the
Multiplicative Neural Networks
287
output value of the network is at least 2 if and only if si ej 2 S1 . This shows that S is shattered by N . We now employ the previous result and give a proof of theorem 10. Proof of Theorem 10. The idea is to take a set S0 constructed as in lemma 6 and, as shown there, shattered by a network N 0 with a monomial as output node and one hidden layer of summing units that use the activation function 1 C 1 / cosh. Then S0 is transformed into a set S and N 0 into a network N such that for every dichotomy (S00 , S01 ) induced by N 0 on S0 , the network N induces the corresponding dichotomy ( S 0, S1 ) of S. Assume that n and k are given as supposed and let S0 be the set dened in lemma 6, choosing h D bk / 8c, m D n ¡ blog(k / 4)c, and r D blog(k / 8) c. Note that the assumption k · 2n C 2 ensures that m ¸ 0. Then S0 has cardinality m ¢ h ¢ r D ( n ¡ blog(k / 4) c) ¢ bk / 8c ¢ blog(k / 8) c. Furthermore, we have m C r D n ¡ 1 and hence S0 µ Rn¡1 , and lemma 6 implies that S0 is shattered by a network N 0 with n ¡ 1 input nodes, a monomial as output node, and one hidden layer of h C 2r · bk / 4c summing units with activation function 1 C 1 / cosh. From S0 we construct S µ Rn by dening 0
0
S D f(es1 , . . . , esn¡1 , e) : ( s01 , . . . , s0n¡1 ) 2 S0 g. In other words, S is obtained from S0 by appending a component containing 1 to each vector and applying the function y 7! exp(y) to every component. On some input vector s0 2 S0 , a hidden node of N 0 with weight vector w and threshold t computes 1C
1 exp(w ¢ s0 ¡ t) ¡ exp(¡w ¢ s0 C t) C 2 . D cosh(w ¢ s0 ¡ t) exp(w ¢ s0 ¡ t) C exp(¡w ¢ s0 C t)
(5.3)
If s D ( s1 , . . . , sn ) is the vector in S obtained from the (unique) vector s0 D ( s01 , . . . , s0n¡1 ) in S0 , then according to the construction of S (s01 , . . . , s0n¡1 , 1) D (ln( s1 ) , . . . , ln(sn¡1 ) , ln(sn ) ) , which implies that an exponential on the right-hand side of equation 5.3 with weights w and threshold t yields on input vector s0 the same output as a product unit with weights w, t on input vector s. (It is clear now that the reason for appending one component to the vectors in S0 was to accommodate the threshold t as a weight in a product unit.) Therefore, the computation of a summing unit with activation function 1 C 1 / cosh on s0 2 S0 can be simulated
288
Michael Schmitt
by feeding the vector s 2 S into a network with two hidden layers, where the rst layer consists of two product units, the second layer has two linear units, and the output node computes a division. Furthermore, this network of four hidden nodes has 2n connections with adjustable weights and seven connections with xed weights (two for each linear unit, one for the threshold of the linear unit computing the numerator, and two for the division). Replacing all bk / 4c hidden nodes of N 0 in this way, we obtain the network N , which has at most k hidden nodes arranged in two layers, where the rst hidden layer consists of product units and the second of linear units. The output node has to compute a product of divisions, which can be done by a single product unit. Further, N has 2nbk / 4c adjustable and 7bk/ 4c xed weights. Thus, N has the properties as claimed and shatters the set S, which has the same cardinality as S0 . From the previous result, we derive the following more simplied statement of a superlinear lower bound. Corollary 8. Let n, k be natural numbers where 16 · k · 2n / 2 C 2 . There is a network of product and linear units with n input units, at most k hidden nodes in two layers, and at most nk weights that has VC dimension at least ( nk/ 32) log(k / 16) . Proof. The network constructed in the proof of theorem 10 has 2nbk / 4c C 7bk / 4c · nk / 2 C 2k weights, which are, using 2 · n / 2 from the assumptions, not more than nk weights. The VC dimension of this network was shown to be at least ( n ¡ blog(k / 4) c) ¢ bk / 8c ¢ blog(k / 8) c. Now, k · 2n/ 2 C 2 implies n ¡ blog(k / 4) c ¸ n / 2, from k ¸ 16 we get bk / 8c ¸ k/ 8 ¡ 1 ¸ k / 16, and at last we use blog(k/ 8) c ¸ log(k / 8) ¡ 1 D log(k / 16). 6 Summary and Conclusion Multiplication is an arithmetical operation that, when used in neural networks, certainly helps to increase their computational power by allowing neural inputs to interact nonlinearly. The question is how this gain is reected in quantitative measures of complexity and of, in particular, analog computational power. In this article, we have dealt with two such measures: the Vapnik-Chervonenkis dimension and the pseudo-dimension. We have derived upper and lower bounds on these dimensions for neural networks in which multiplication occurs as a fundamental operation in the interaction of network elements. An overview of the results is given in Table 1, where we present the bounds mainly in asymptotic form, abstracting from most of the constant factors.
Product Monomial
Single Class
Class
Class
Single
Product and summing Product and summing Product and summing Higher-order
Single
Product and sigmoidal Higher-order sigmoidal
Unit Types Unit type variable Exponents as parameters Exponents xed Summing units in second hidden layer k hidden nodes Input nodes fan-out · l Input nodes fan-out · l, exponents · d Input nodes fan-out · l, exponents 1 Input nodes fan-out 1, exponents 1 k monomials, exponents · d k monomials, exponents 1
4(Wk)2 C O (Wk log(Wk)) 9(W2 k)2 C O (W 2 k log(W2 k)) 36 (Wk)4 C O ((Wk)2 log(Wk)) V (W log k)
4(nl)4 C O ((nl)2 log(nl)) O ((nl)4 ), O (n2 l log d), O (nl log dnl)
O (n2 k4 ), O (nk log d) V (nk) n n
V (n log n)
V (nl)
(Wk)2 C O (Wk log W)
Remarks
Bound
Theorem 5 Corollary 7 Theorem 7 Corollary 5
Corollary 6
Corollary 7
Theorem 6
Corollary 3
Corollary 1
Theorem 3 Corollary 8
Theorem 4
Corollary 4
Reference
Notes: If not otherwise stated, W, k, and n refer to the number of parameters, computation nodes, and input nodes, respectively. Upper bounds are valid for the pseudo-dimension, lower bounds for the VC dimension.
Single unit
Two hidden layers One hidden layer
Class
General feedforward
Single
Class/Single
Architecture
Table 1: Survey of the Results.
Multiplicative Neural Networks 289
290
Michael Schmitt
The bounds are given in terms of the numbers of network parameters and computation nodes and, for classes, in terms of the restrictions that characterize the architectures in the respective class. We highlight two features. First, the upper bounds are all polynomials of low order. In particular, the bound for general feedforward networks exhibits the same order of magnitude as the best-known upper bound for purely sigmoidal networks—and this even in the case when it is not predetermined whether a node is to become a summing or a product unit. Second, the upper bounds for higherorder networks and some of the bounds for classes of higher-order units do not involve any constraint on the order. It is therefore impossible to nd lower bounds that exhibit a growth in terms of the order only. This limitation is also indicated by the fact that some lower bounds for classes of higher-order units are already tight for order one. In this case, the degree of multiplicativity cannot help in proving better lower bounds. In general, the results show that multiplication in neural networks does not lead to an immeasurable growth of VC dimension and pseudo-dimension. In practical uses of articial neural networks, such as, for instance, in pattern recognition, higher-order networks and product unit networks are considered natural extensions of the classical linear summing networks. We have reported about some applications where learning algorithms have been designed for training multiplicative networks. The question of how well networks resulting from these algorithms generalize is theoretically studied in numerous models of learning. In a major part of them, the VC dimension and the pseudo-dimension play a central role. They can be used to estimate the number of training examples required by learning algorithms to generate hypotheses with low generalization error. The results given here imply that estimates can now be given for higher-order sigmoidal networks that do not come with an a priori restriction of their order. Hence, one need not cope with a sample complexity that grows with the order. For learning applications, this suggests the use of higher-order networks without any limit on the order. Further, the estimates are valid for a class of neural network learning algorithms that has yet to be developed. They hold even if the algorithm is allowed to decide for each node whether it is to be a summing or a product unit. Apart from applications of learning, multiplicative neural networks are used for modeling the behavior of biological nervous systems or parts thereof. In this context, questions arise as to what type of functions have been captured in a model that has been constructed in accordance with experimental observations. The VC dimension and the pseudo-dimension are combinatorial measures for the complexity and diversity of function classes. As such, they can be used to compare networks with respect to their expressiveness. Moreover, using upper bounds on these dimensions, lower bounds on the size of networks for the computation and approximation of functions can be calculated. By means of the results given here, such calculations can now be done for multiplicative neural networks. Thus, a new
Multiplicative Neural Networks
291
tool is available for the assessment of these networks and for the verication of their proper use in neural modeling. Our investigations naturally raise some new questions. Most prominent, since it is also an open problem for networks of sigmoidal units, is the issue whether signicantly better upper bounds can be shown for networks of xed depth. The bounds for depth-restricted networks established so far coincide with the bounds for general feedforward networks. For the latter, however, quadratic lower bounds have been derived using a method that does not apply to constant-depth networks. Thus, the gap between upper and lower bound for depth-restricted networks is larger than in the general feedforward case. The so-called fat-shattering dimension is a further combinatorial measure that is known to give bounds on the complexity of learning. Since it is bounded from above by the pseudo-dimension, the results in this article imply upper bounds on the fat-shattering dimension. Moreover, when the output node is a linear unit, the fat-shattering dimension is equal to the pseudo-dimension. It is an interesting question whether for networks with nonlinear output nodes, bounds can be obtained for the fat-shattering dimension that are signicantly smaller than the pseudo-dimension of the network. The lower bounds we presented are all derived for the VC dimension and, hence, are by denition also valid for the pseudo-dimension. It is currently not known how to obtain lower bounds for the pseudo-dimension of neural networks directly. Finally, our calculations resulted in several constants appearing in the bounds. We did not strive for obtaining optimal values but were content with the constants being small. Certainly, improvements might be possible using more detailed calculations or new approaches. Appendix: Solution Set Components Bounds In the following we give the proofs of the lemmas required for the upper bounds in section 4. Proof of Lemma 1. Consider for 1 · k · d an arbitrary set f f1 , . . . , fkg µ G that has regular zero-set intersections and let pi be the degree of fi . It follows from KhovanskiÆ õ (1991, p. 91, corollary 3) that if l is the dimension of the set fa 2 Rd : f1 ( a ) D ¢ ¢ ¢ D fk ( a) D 0g, then this set has at most 2q ( q¡1 ) / 2 p1 ¢ ¢ ¢ p kSl [(l C 1) S ¡ l]q connected compoPk nents where S D i D1 pi C l C 1. From pi · p, k · d, and l · d, we get S · ( p C 1) d C 1 and ( l C 1) S ¡ l · ( p C 1) d ( d C 1) C 1, which implies the result.
292
Michael Schmitt
Proof of Lemma 2. Let k · d and consider some arbitrary set f f1 , . . . , fkg µ G that has regular zero-set intersections. According to the assumptions, for i D 1, . . . , k each fi can be written in the form ( y1 , . . . , yd ) 7! ai C bi,1 y1 C ¢ ¢ ¢ C bi,d yd C ci,1 egi,1 C ¢ ¢ ¢ C ci,ri egi, ri , where ri · q, the gi, j are afne functions in y1 , . . . , yd , and ai , bi,1 , . . . , bi,d , ci,1 , . . . , ci,ri are real numbers. We introduce new functions fQi and hi, j in y1 , . . . , yd and in new variables zi, j by dening fQi ( y1 , . . . , yd , zi,1 , . . . , zi,ri ) D ai C bi,1 y1 C ¢ ¢ ¢ C bi,d yd C ci,1 ezi, 1 C ¢ ¢ ¢ C ci,ri ezi,r i ,
hi, j ( y1 , . . . , yd , zi, j ) D gi, j ( y1 , . . . , yd ) ¡ zi, j , for i D 1, . . . , k and j D 1, . . . , ri . Let G Q be the class of afne functions in y1 , . . . , yd , and in zi, j and ezi, j , for i D 1, . . . , k and j D 1, . . . , q. Clearly, the functions fQi and hi, j are elements of G Q . Furthermore, since f1 , . . . , fk are chosen arbitrarily from G and at most q new variables are introduced for each fi , the classes G and G Q satisfy denition 7.12 of Anthony and Bartlett (1999), that is, G Q computes G with q intermediate variables. In particular, the partial derivative of hi, j with respect to the variable zi, j is ¡1, and hence nonzero. Thus, the derivative condition iii of the denition 7.12 is met. It follows from theorem 7.13 of Anthony and Bartlett (1999) that any solution set components bound for G Q is also a solution set components bound for G . Since G Q consists of polynomials of degree 1 in dq C d variables and dq xed exponentials, by virtue of lemma 1, class G Q has solution set components bound B D 2dq ( dq¡1 ) / 2 [2(dq C d ) C 1]dqC d [2(dq C d ) ( dq C d C 1) C 1]dq, which is hence also a solution set components bound for G as claimed. Proof of Lemma 3. Let f f1 , . . . , fk g µ F and fs1 , . . . , sm g µ Rn be given, and T be dened as above. In the proof of theorem 7.8 in Anthony and Bartlett (1999), it is shown that then there exist real numbers l1,1 , . . . , l k,m such that the following holds: Let C denote the number of connected components of the set
Rd ¡
k [ m [ iD 1 jD 1
fa 2 Rd : fi ( a, sj ) ¡ li, j D 0g.
Multiplicative Neural Networks
293
Then T satises |T| · C, and the set of functions fa 7! fi (a, sj ) ¡ li, j : i D 1, . . . , kI j D 1, . . . , mg has regular zero-set intersections. Clearly, this set is a subset of G , which has solution set components bound B. In the proof of theorem 7.6 in Anthony and Bartlett (1999), it is shown that this implies C·B
´ d ³ X mk iD 0
i
³ ·B
emk d
´d
for m ¸ d / k. Hence, the claimed result follows using |T| · C. Proof of Lemma 4. Let f f1 , . . . , fk g µ G , where k · d, be some arbitrary set of functions that has regular zero-set intersections. According to the assumptions, there is for each fi a network that computes fi with r and q computation nodes as described. We number the nodes such that the computation of each node depends only on nodes with a smaller number. Then for i D 1, . . . , k and j D 1, . . . , r, the computation performed by node j in the network for fi can be represented by a function ni, j in the variables y1 , . . . , yd that is recursively dened by 8 ci, j C 1 / (1 C exp(¡ni,l ( y ) ) , > > > ( ) ) ln(n y , i,l > > : pi, j ( y, ni,1 ( y) , . . . , ni, j¡1 ( y ) ) ,
l < j, l < j, l < j,
depending on whether node j computes the function a 7! ci, j C 1 / (1 C e¡a ), a 7! ci, j § ea , a 7! ln a, or the degree 2 polynomial pi, j , respectively. We introduce new functions gi, j , gQ i, j in y1 , . . . , yd and in new variables zi, j , zQ i, j corresponding to the above four cases for ni, j as follows: If node j in the network for fi computes 1. the function a 7! ci, j C 1 / (1 C e¡a ), then gQ i, j ( y, zQ i, j ) D ni,l ( y ) C zQ i, j ,
gi, j ( zi, j , zQ i, j ) D (zi, j ¡ ci, j ) (1 C exp( zQi, j ) ) ¡ 1, 2. the function a 7! ci, j § ea , then gQ i, j ( y, zQ i, j ) D ni,l ( y ) ¡ zQ i, j ,
gi, j ( zi, j , zQ i, j ) D exp(zQ i, j ) § ci, j ¡ zi, j ,
294
Michael Schmitt
3. the function a 7! ln a, then gQ i, j ( y, zQ i, j ) D ni,l ( y) ¡ exp( zQ i, j ) , gi, j ( zi, j , zQ i, j ) D zQ i, j ¡ zi, j , 4. the degree 2 polynomial pi, j then gi, j ( y, zi,1 , . . . , zi, j ) D pi, j ( y, zi,1 , . . . , zi, j¡1 ) ¡ zi, j , where l, ci, j , and pi, j are as in the corresponding denition of ni, j above. Let G Q be the class of polynomials of degree at most 2 in the variables y1 , . . . , yd and in zi, j , zQ i, j and exp(zQ i, j ) . There are at most kr variables zi, j and—since the variables zQ i, j are introduced only for those nodes that compute a nonpolynomial function—at most kq variables zQ i, j and kq exponentials exp( zQ i, j ) . Clearly, G contains the functions gQ i, j and gi, j , which implicitly dene the variables zQ i, j and zi, j , respectively. The partial derivative of gQ i, j with respect to zQ i, j in cases 1–3 is 1, ¡1, and ¡ exp( zQ i, j ), respectively. For gi, j the partial derivative with respect to zi, j is 1 C exp( zQ i, j ) in case 1 and ¡1 in cases 2–4. All of these partial derivatives are everywhere nonzero. Hence, condition iii of denition 7.12 in Anthony and Bartlett (1999) is satised. Furthermore, theorem 7.13 of Anthony and Bartlett (1999) implies that G Q computes G with r C q intermediate variables, and any solution set components bound for G Q is also a solution set components bound for G . The polynomials in G Q are of degree at most 2 in no more than 2dr C d variables and dq xed exponentials. Thus, lemma 1 shows that G Q has solution set components bound ( ) B D 2dq dq¡1 / 2 [6(2dr C d) C 2]2dr C d [3(2dr C d ) (2dr C d C 1) C 1]dq,
and we conclude that this is also a solution set components bound for G . Acknowledgments I thank the anonymous referees for helpful comments. This work has been supported in part by the ESPRIT Working Group in Neural and Computational Learning II, NeuroCOLT2, No. 27150. Some of the results have been presented at the NeuroCOLT Workshop “New Perspectives in the Theory of Neural Nets” in Graz, Austria, on May 3, 2000. I am grateful to Wolfgang Maass for the invitation to give a talk at this meeting. References Andersen, R. A., Essick, G. K., & Siegel, R. M. (1985). Encoding of spatial location by posterior parietal neurons. Science, 230, 456–458. Anthony, M. (1995). Classication by polynomial surfaces. Discrete Applied Mathematics, 61, 91–103.
Multiplicative Neural Networks
295
Anthony, M., & Bartlett, P. L. (1999). Neural network learning: Theoretical foundations. Cambridge: Cambridge University Press. Anzai, A., Ohzawa, I., & Freeman, R. D. (1999a). Neural mechanisms for processing binocular information I. Simple cells. Journal of Neurophysiology, 82, 891–908. Anzai, A., Ohzawa, I., & Freeman, R. D. (1999b). Neural mechanisms for processing binocular information II. Complex cells. Journal of Neurophysiology, 82, 909–924. Bartlett, P. L., Maiorov, V., & Meir, R. (1998). Almost linear VC dimension bounds for piecewise polynomial networks. Neural Computation, 10, 2159–2173. Baum, E. B., & Haussler, D. (1989). What size net gives valid generalization? Neural Computation, 1, 151–160. Ben-David, S., Cesa-Bianchi, N., Haussler, D., & Long, P. M. (1995). Characterizations of learnability for classes of f0, . . . , ng-valued functions. Journal of Computer and System Sciences, 50, 74–86. Ben-David, S., & Lindenbaum, M. (1998). Localization vs. identication of semialgebraic sets. Machine Learning, 32, 207–224. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press. Blomeld, S. (1974). Arithmetical operations performed by nerve cells. Brain Research, 69, 115–124. Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1989). Learnability and the Vapnik-Chervonenkis dimension. Journal of the Association for Computing Machinery, 36, 929–965. Bugmann, G. (1991). Summation and multiplication: Two distinct operation domains of leaky integrate-and-re neurons. Network: Computation in Neural Systems, 2, 489–509. Bugmann, G. (1992). Multiplying with neurons: Compensation for irregular input spike trains by using time-dependent synaptic efciencies. Biological Cybernetics, 68, 87–92. Burshtein, D. (1998). Long-term attraction in higher order neural networks. IEEE Transactions on Neural Networks, 9, 42–50. Carandini, M., & Heeger, D. J. (1994). Summation and division by neurons in primate visual cortex. Science, 264, 1333–1336. Cover, T. M. (1965). Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Transactions on Electronic Computers, 14, 326–334. Cover, T. M. (1968). Capacity problems for linear machines. In L. N. Kanal (Ed.), Pattern recognition (pp. 283–289). Washington, DC: Thompson Book Co. Duda, R. O., & Hart, P. E. (1973). Pattern classication and scene analysis. New York: Wiley. Durbin, R., & Rumelhart, D. (1989). Product units: A computationally powerful and biologically plausible extension to backpropagation networks. Neural Computation, 1, 133–142. Fahner, G., & Eckmiller, R. (1994). Structural adaptation of parsimonious higherorder neural classiers. Neural Networks, 7, 279–289.
296
Michael Schmitt
Feldman, J. A., & Ballard, D. H. (1982). Connectionist models and their properties. Cognitive Science, 6, 205–254. Gabbiani, F., Krapp, H. G., & Laurent, G. (1999). Computation of object approach by a wide-eld, motion-sensitive neuron. Journal of Neuroscience, 19, 1122– 1141. Ghosh, J., & Shin, Y. (1992). Efcient higher-order neural networks for classication and function approximation. International Journal of Neural Systems, 3, 323–350. Giles, C. L., & Maxwell, T. (1987). Learning, invariance, and generalization in high-order neural networks. Applied Optics, 26, 4972–4978. Giles, C. L., Miller, C. B., Chen, D., Chen, H. H., Sun, G. Z., & Lee, Y. C. (1992). Learning and extracting nite state automata with second-order recurrent neural networks. Neural Computation, 4, 393–405. Giles, C. L., Sun, G. Z., Chen, H. H., Lee, Y. C., & Chen, D. (1990). Higher order recurrent networks and grammatical inference. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 380–387). San Mateo, CA: Morgan Kaufmann. Goldberg, P. W., & Jerrum, M. R. (1995). Bounding the Vapnik-Chervonenkis dimension of concept classes parameterized by real numbers. Machine Learning, 18, 131–148. Hancock, T. R., Golea, M., & Marchand, M. (1994). Learning nonoverlapping perceptron networks from examples and membership queries. Machine Learning, 16, 161–183. Hatsopoulos, N., Gabbiani, F., & Laurent, G. (1995). Elementary computation of object approach by a wide-eld visual neuron. Science, 270, 1000– 1003. Haussler, D. (1992). Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100, 78–150. Haussler, D., Kearns, M., & Schapire, R. E. (1994). Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension. Machine Learning, 14, 83–113. Heywood, M., & Noakes, P. (1995). A framework for improved training of sigma-pi networks. IEEE Transactions on Neural Networks, 6, 893–903. Hofmeister, T. (1994). Depth-efcient threshold circuits for arithmetic functions. In V. Roychowdhury, K.-Y. Siu, & A. Orlitsky (Eds.), Theoretical advances in neural computation and learning (pp. 37–84). Norwell, MA: Kluwer. Ismail, A., & Engelbrecht, A. P. (2000). Global optimization algorithms for training product unit neural networks. In International Joint Conference on Neural Networks IJCNN’2000 (Vol. I, pp. 132–137). Los Alamitos, CA: IEEE Computer Society. Janson, D. J., & Frenzel, J. F. (1993). Training product unit neural networks with genetic algorithms. IEEE Expert, 8(5), 26–33. Karpinski, M., & Macintyre, A. (1997). Polynomial bounds for VC dimension of sigmoidal and general Pfafan neural networks. Journal of Computer and System Sciences, 54, 169–176.
Multiplicative Neural Networks
297
Karpinski, M., & Werther, T. (1993). VC dimension and uniform learnability of sparse polynomials and rational functions. SIAM Journal on Computing, 22, 1276–1285. KhovanskiÆ õ , A. G. (1991). Fewnomials. Providence, RI: American Mathematical Society. Koch, C. (1999). Biophysics of computation. New York: Oxford University Press. Koch, C., & Poggio, T. (1992). Multiplying with synapses and neurons. In T. McKenna, J. Davis, & S. Zornetzer (Eds.), Single neuron computation, (pp. 315–345). Boston: Academic Press. Koch, C., Poggio, T., & Torre, V. (1983). Nonlinear interactions in a dendritic tree: Localization, timing, and role in information processing. Proceedings of the National Academy of Sciences USA, 80, 2799–2802. Koiran, P. (1996). VC dimension in circuit complexity. In Proceedings of the 11th Annual IEEE Conference on Computational Complexity CCC’96 (pp. 81–85). Los Alamitos, CA: IEEE Computer Society Press. Koiran, P., & Sontag, E. D. (1997). Neural networks with quadratic VC dimension. Journal of Computer and System Sciences, 54, 190–198. Koiran, P., & Sontag, E. D. (1998). Vapnik-Chervonenkis dimension of recurrent neural networks. Discrete Applied Mathematics, 86, 63–79. Kowalczyk, A., & FerrÂa, H. L. (1994). Developing higher-order networks with empirically selected units. IEEE Transactions on Neural Networks, 5, 698–711. ¨ Kupfm ¨ uller, ¨ K., & Jenik, F. (1961). Uber die Nachrichtenverarbeitung in der Nervenzelle. Kybernetik, 1, 1–6. Lee, Y. C., Doolen, G., Chen, H. H., Sun, G. Z., Maxwell, T., Lee, H., & Giles, C. L. (1986). Machine learning using a higher order correlation network. Physica D, 22, 276–306. Leerink, L. R., Giles, C. L., Horne, B. G., & Jabri, M. A. (1995a). Learning with product units. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 537–544). Cambridge, MA: MIT Press. Leerink, L. R., Giles, C. L., Horne, B. G., & Jabri, M. A. (1995b). Product unit learning (Tech. Rep. UMIACS-TR-95-80). College Park: University of Maryland. Leshno, M., Lin, V. Y., Pinkus, A., & Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 6, 861–867. Maass, W. (1994). Neural nets with super-linear VC-dimension. Neural Computation, 6, 877–884. Maass, W. (1995a). Agnostic PAC learning of functions on analog neural nets. Neural Computation, 7, 1054–1078. Maass, W. (1995b). Vapnik-Chervonenkis dimension of neural nets. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 1000–1003). Cambridge, MA: MIT Press. Maass, W. (1997). Noisy spiking neurons with temporal coding have more computational power than sigmoidal neurons. In M. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 211– 217). Cambridge, MA: MIT Press.
298
Michael Schmitt
Maass, W. (1998). A simple model for neural computation with ring rates and ring correlations. Network: Computation in Neural Systems, 9, 381–397. Maass, W., & TurÂan, G. (1992). Lower bound methods and separation results for on-line learning models. Machine Learning, 9, 107–145. Maxwell, T., Giles, C. L., Lee, Y. C., & Chen, H. H. (1986). Nonlinear dynamics of articial neural systems. In J. S. Denker (Ed.), Neural networks for computing. New York: American Institute of Physics. McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5, 115–133. Mel, B. W. (1992a). The clusteron: Toward a simple abstraction for a complex neuron. In J. Moody, S. Hanson, & R. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 35–42). San Mateo, CA: Morgan Kaufmann. Mel, B. W. (1992b). NMDA-based pattern discrimination in a modeled cortical neuron. Neural Computation, 4, 502–517. Mel, B. W. (1993). Synaptic integration in an excitable dendritic tree. Journal of Neurophysiology, 70, 1086–1101. Mel, B. W. (1994). Information processing in dendritic trees. Neural Computation, 6, 1031–1085. Mel, B. W., & Koch, C. (1990). Sigma-pi learning: On radial basis functions and cortical associative learning. In D. S. Touretzky (Ed.), Advances in neural information processingsystems,2 (pp. 474–481). San Mateo, CA: Morgan Kaufmann. Mel, B. W., Ruderman, D. L., & Archie, K. A. (1998). Translation-invariant orientation tuning in visual “complex” cells could derive from intradendritic computations. Journal of Neuroscience, 18, 4325–4334. Minsky, M. L., & Papert, S. A. (1988). Perceptrons:An introduction to computational geometry. Cambridge, MA: MIT Press. Nilsson, N. J. (1965). Learning machines. New York: McGraw-Hill. Olshausen, B. A., Anderson, C. H., & Van Essen, D. C. (1993). A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. Journal of Neuroscience, 13, 4700–4719. Omlin, C. W., & Giles, C. L. (1996a). Constructing deterministic nite-state automata in recurrent neural networks. Journal of the Associationfor Computing Machinery, 43, 937–972. Omlin, C. W., & Giles, C. L. (1996b). Stable encoding of large nite-state automata in recurrent neural networks with sigmoid discriminants. Neural Computation, 8, 675–696. Perantonis, S. J., & Lisboa, P. J. G. (1992). Translation, rotation, and scale invariant pattern recognition by high-order neural networks and moment classiers. IEEE Transactions on Neural Networks, 3, 241–251. Pinkus, A. (1999). Approximation theory of the MLP model in neural networks. Acta Numerica, 8, 143–195. Poggio, T. (1990). A theory of how the brain might work. In Cold Spring Harbor Symposia on Quantitative Biology (Vol. 55, pp. 899–910). Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press. Poggio, T., & Girosi, F. (1990a). Networks for approximation and learning. Proceedings of the IEEE, 78, 1481–1497.
Multiplicative Neural Networks
299
Poggio, T., & Girosi, F. (1990b). Regularization algorithms for learning that are equivalent to multilayer networks. Science, 247, 978–982. Pollack, J. B. (1991). The induction of dynamical recognizers. Machine Learning, 7, 227–252. Pouget, A., & Sejnowski, T. J. (1997). Spatial transformations in the parietal cortex using basis functions. Journal of Cognitive Neuroscience, 9, 222–237. Psaltis, D., Park, C. H., & Hong, J. (1988). Higher order associative memories and their optical implementations. Neural Networks, 1, 149–163. Rebotier, T. P., & Droulez, J. (1994). Sigma vs pi properties of spiking neurons. In M. Mozer, P. Smolensky, D. Touretzky, J. Elman, & A. Weigend (Eds.), Proceedings of the 1993 Connectionist Models Summer School (pp. 3–10). Hillsdale, NJ: Erlbaum. Redding, N. J., Kowalczyk, A., & Downs, T. (1993). Constructive higher-order network algorithm that is polynomial time. Neural Networks, 6, 997–1010. Ring, M. (1993). Learning sequential tasks by incrementally adding higher orders. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems, 5 (pp. 115–122). San Mateo, CA: Morgan Kaufmann. Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386–408. Roy, A., & Mukhopadhyay, S. (1997). Iterative generation of higher-order nets in polynomial time using linear programming. IEEE Transactions on Neural Networks, 8, 402–412. Rumelhart, D. E., Hinton, G. E., & McClelland, J. L. (1986). A general framework for parallel distributed processing. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 1, pp. 45–76). Cambridge, MA: MIT Press. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 1, pp. 318–362). Cambridge, MA: MIT Press. Sakurai, A. (1993). Tighter bounds of the VC-dimension of three layer networks. In Proceedings of the World Congress on Neural Networks (Vol. 3, pp. 540–543). Hillsdale, NJ: Erlbaum. Sakurai, A. (1999). Tight bounds for the VC-dimension of piecewise polynomial networks. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 323–329). Cambridge, MA: MIT Press. Sakurai, A., & Yamasaki, M. (1992). On the capacity of n-h-s networks. In I. Aleksander & J. Taylor (Eds.), Articial neural networks (Vol. 2, pp. 237–240). Amsterdam: Elsevier. Salinas, E., & Abbott, L. F. (1996). A model of multiplicative neural responses in parietal cortex. Proceedings of the National Academy of Sciences USA, 93, 11956–11961. Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3, 211–229. Schl¨ai, L. (1901). Theorie der vielfachen Kontinuit¨at. Zurich: ¨ Zurcher ¨ & Furrer.
300
Michael Schmitt
Schmidt, W. A. C., & Davis, J. P. (1993). Pattern recognition properties of various feature spaces for higher order neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15, 795–801. Schmitt, M. (1999). On the sample complexity for nonoverlapping neural networks. Machine Learning, 37, 131–141. Schmitt, M. (2000). Lower bounds on the complexity of approximating continuous functions by sigmoidal neural networks. In S. A. Solla, T. K. Leen & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12 (pp. 328–334). Cambridge, MA: MIT Press. Sejnowski, T. J. (1986). Higher-order Boltzmann machines. In J. S. Denker (Ed.), Neural networks for computing (pp. 398–403). New York: American Institute of Physics. Shawe-Taylor, J., & Anthony, M. (1991). Sample sizes for multiple-output threshold networks. Network: Computation in Neural Systems, 2, 107–117. Shin, Y., & Ghosh, J. (1995). Ridge polynomial networks. IEEE Transactions on Neural Networks, 6, 610–622. Siu, K.-Y., Roychowdhury, V., & Kailath, T. (1995). Discrete neural computation: A theoretical foundation. Englewood Cliffs, NJ: Prentice Hall. Softky, W., & Koch, C. (1995). Single-cell models. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 879–884). Cambridge, MA: MIT Press. Sontag, E. (1992). Feedforward nets for interpolation and classication. Journal of Computer and System Sciences, 45, 20–48. Spirkovska, L., & Reid, M. B. (1994). Higher-order neural networks applied to 2D and 3D object recognition. Machine Learning, 15, 169–199. Srinivasan, M. V., & Bernard, G. D. (1976). A proposed mechanism for multiplication of neural signals. Biological Cybernetics, 21, 227–236. Suga, N. (1990). Cortical computational maps for auditory imaging. Neural Networks, 3, 3–21. Suga, N., Olsen, J. F., & Butman, J. A. (1990). Specialized subsystems for processing biologically important complex sounds: Cross-correlation analysis for ranging in the bat’s brain. In Cold Spring Harbor Symposia on Quantitative Biology (Vol. 55, pp. 585–597). Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press. Tal, D., & Schwartz, E. L. (1997). Computing with the leaky integrate-and-re neuron: Logarithmic computation and multiplication. Neural Computation, 9, 305–318. Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27, 1134–1142. Venkatesh, S. S., & Baldi, P. (1991a). Programmed interactions in higher-order neural networks: Maximal capacity. Journal of Complexity, 7, 316–337. Venkatesh, S. S., & Baldi, P. (1991b). Programmed interactions in higher-order neural networks: The outer-product algorithm. Journal of Complexity, 7, 443– 479. Warren, H. E. (1968). Lower bounds for approximation by nonlinear manifolds. Transactions of the American Mathematical Society, 133, 167–178.
Multiplicative Neural Networks
301
Watrous, R. L., & Kuhn, G. M. (1992). Induction of nite-state languages using second-order recurrent networks. Neural Computation, 4, 406–414. Williams, R. J. (1986). The logic of activation functions. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 1, pp. 423–443). Cambridge, MA: MIT Press. Yoon, H., & Oh, J.-H. (1998). Learning of higher-order perceptrons with tunable complexities. Journal of Physics A: Math. Gen., 31, 7771–7784. Received July 27, 2000; accepted April 9, 2001.
NOTE
Communicated by Eric Mjolsness
A Lagrange Multiplier and Hopeld-Type Barrier Function Method for the Traveling Salesman Problem Chuangyin Dang
[email protected] Department of Manufacturing Engineering and Engineering Management, City University of Hong Kong, Kowloon, Hong Kong Lei Xu
[email protected] Department of Computer Science and Engineering, Chinese University of Hong Kong, New Territories, Hong Kong A Lagrange multiplier and Hopeld-type barrier function method is proposed for approximating a solution of the traveling salesman problem. The method is derived from applications of Lagrange multipliers and a Hopeld-type barrier function and attempts to produce a solution of high quality by generating a minimum point of a barrier problem for a sequence of descending values of the barrier parameter. For any given value of the barrier parameter, the method searches for a minimum point of the barrier problem in a feasible descent direction, which has a desired property that lower and upper bounds on variables are always satised automatically if the step length is a num ber between zero and one. At each iteration, the feasible descent direction is found by updating Lagrange multipliers with a globally convergent iterative procedure. For any given value of the barrier parameter, the method converges to a stationary point of the barrier problem without any condition on the objective function. Theoretical and numerical results show that the method seems more effective and efcient than the softassign algorithm. 1 Introduction
The traveling salesman problem (TSP) is an NP-hard combinatorial optimization problem and has many important applications. In order to solve it, a number of classic algorithms and heuristics have been proposed. We refer to Lawler, Lenstra, Rinnooy, and Shmoys (1985) for an excellent survey of techniques for solving the problem. Since Hopeld and Tank (1985), combinatorial optimization has become a popular topic in the literature of neural computation. Many neural computational models for combinatorial optimization have been developed. They include Aiyer, Niranjan, and Fallside (1990); van den Bout and Miller (1990); Neural Computation 14, 303–324 (2001)
° c 2001 Massachusetts Institute of Technology
304
Chuangyin Dang and Lei Xu
Durbin and Willshaw (1987); Gee, Aiyer, and Prager (1993); Gee and Prager (1994); Gold, Mjolsness, and Rangarajan (1994); Gold and Rangarajan (1996); Peterson and Soderberg (1989); Rangarajan, Gold, and Mjolsness (1996); Simic (1990); Urahama (1996); Wacholder, Han, and Mann (1989); Waugh and Westervelt (1993); Wolfe, Parry, and MacMillan (1994); Xu (1994); and Yuille and Kosowsky (1994). A systematic investigation of such neural computational models for combinatorial optimization can be found in van den Berg (1996) and Cichocki and Unbehaunen (1993). Most of these algorithms are of the deterministic annealing type, which is a heuristic continuation method that attempts to nd the global minimum of the effective energy at high temperature and track it as the temperature decreases. There is no guarantee that the minimum at high temperature can always be tracked to the minimum at low temperature, but the experimental results are encouraging (Yuille & Kosowsky, 1994). We propose a Lagrange multiplier and a Hopeld-type barrier function method for approximating a solution of the TSP. The method is derived from applications of Lagrange multipliers to handle equality constraints and a Hopeld-type barrier function to deal with lower and upper bounds on variables. The method is a deterministic annealing algorithm that attempts to produce a high-quality solution by generating a minimum point of a barrier problem for a sequence of descending values of the barrier parameter. For any given value of the barrier parameter, the method searches for a minimum point of the barrier problem in a feasible descent direction, which has the desired property that the lower and upper bounds on variables are always satised automatically if the step length is a number between zero and one. At each iteration, the feasible descent direction is found by updating Lagrange multipliers with a globally convergent iterative procedure. For any given value of the barrier parameter, the method converges to a stationary point of the barrier problem without any condition on the objective function. Theoretical and numerical results show that the method seems more effective and efcient than the softassign algorithm. The rest of this paper is organized as follows. We introduce the Hopeldtype barrier function and derive some properties in section 2. We present the method in section 3. We report some numerical results in section 4. We conclude in section 5. 2 Hopeld-Type Barrier Function
The problem we consider is as follows. Given n cities, nd a tour such that each city is visited exactly once and that the total distance traveled is minimized. Let ( vik D
1 0
if City i is the kth city to be visited in a tour, otherwise,
Traveling Salesman Problem
305
where i D 1, 2, . . . , n, k D 1, 2, . . . , n, and v D (v11 , v12 , . . . , v1n , . . . , vn1 , vn2 , . . . , vnn ) > . In Hopeld and Tank (1985), the problem was formulated as min subject to
Pn P n P n
kD1 dij vik vj, k C 1
i D1
jD 1
j D 1 vij
D 1,
i D 1, 2, . . . , n,
i D1 vij
D 1,
j D 1, 2, . . . , n,
Pn Pn
vij 2 f0, 1g,
i D 1, 2, . . . , n,
(2.1) j D 1, 2, . . . , n,
where dij denotes the distance from city i to city j and vj,kC 1 D vj1 for k D n. Clearly, for any given r ¸ 0, equation 2.1 is equivalent to min subject to
² P P ± Pn 1 2 e 0 ( v ) D niD 1 jnD 1 kD1 dij vik vj,k C 1 ¡ 2 rvij Pn i D 1, 2, . . . , n, j D 1 vij D 1, Pn j D 1, 2, . . . , n, i D1 vij D 1, vij 2 f0, 1g,
i D 1, 2, . . . , n,
(2.2)
j D 1, 2, . . . , n.
The continuous relaxation of equation 2.2 yields min subject to
² P P ± Pn 1 2 e 0 ( v ) D niD 1 jnD 1 kD1 dij vik vj,k C 1 ¡ 2 rvij Pn i D 1, 2, . . . , n, j D 1 vij D 1, Pn j D 1, 2, . . . , n, i D1 vij D 1, 0 · vij · 1,
i D 1, 2, . . . , n,
(2.3)
j D 1, 2, . . . , n.
When r is sufciently large, one can see that an optimal solution of equation 2.3 is an integer solution. Thus, when r is large, equation 2.3 Psufciently P is equivalent to equation 2.1. The term ¡ 12 r niD1 jnD1 v2ij was introduced in Rangarajan et al. (1996) to obtain a strictly concave function e0 (v ) on the null space of the constraint matrix for convergence of their softassign algorithm to a stationary point of a barrier problem. We note that the size of r affects the quality of the solution produced by a deterministic annealing algorithm, and it should be as small as possible. However, when r is a small, positive number but still satises that equation 2.3 is equivalent to equation 2.1, the softassign algorithm may not converge to a stationary point of the barrier problem since e0 (v ) may not be strictly concave on the null space of the constraint matrix. Numerical tests demonstrate that it indeed occurs to the softassign algorithm. Following Xu (1995), we introduce a Hopeld-type barrier term, d ( vij ) D vij ln vij C (1 ¡ vij ) ln(1 ¡ vij ) ,
(2.4)
306
Chuangyin Dang and Lei Xu
to incorporate 0 · xij · 1 into the objective function of equation 2.3 and obtain P P min e ( vI b ) D e0 (v ) C b niD 1 jnD 1 d ( vij ) Pn subject to i D 1, 2, . . . , n, (2.5) j D 1 vij D 1, Pn 1, 1, 2, . . . v D j D , n, i D 1 ij where b is a positive barrier parameter. The barrier term, equation 2.4, appeared rst in an energy function given by Hopeld (1984) and has been extensively used in the literature. Instead of solving equation 2.3 directly, we consider a scheme that obtains a solution of it from the solution of equation 2.5 atPthe limit P of b # 0. Let b (v ) D niD1 jnD1 d ( vij ) . Then e (vI b ) D e0 ( v) C bb (v ) . Let 8 9 P n vij D 1, i D 1, 2, . . . , n, > > jD 1 < = Pn P D v iD 1 vij D 1, j D 1, 2, . . . , n, > > : 0 · vij · 1, i D 1, 2, . . . , n, j D 1, 2, . . . , n ; and B D fv | 0 · vij · 1,
i D 1, 2, . . . , n,
j D 1, 2, . . . , ng.
Then P is the feasible region of equation 2.3. Let us dene d (0) D d (1) D 0. Since limvij !0C d (vij ) D limvij !1¡ d (vij ) D 0; hence, b ( v) is continuous on B. From b ( v) , we obtain @b ( v) @vij
D ln vij ¡ ln(1 ¡ vij ) D ln
vij . 1 ¡ vij
Then limC
vij !0
@b ( v ) @vij
D ¡1
and
lim¡
@b ( v)
vij !1
@vij
D 1.
Observe @e0 ( v ) @vij
D
n X kD 1
( d ki vk, j¡1 C dikvk, j C 1 ) ¡ rvij ,
where vk, j¡1 D v kn for j D 1, and v k, j C 1 D vk1 for j D n. Thus, bounded on B. From @e ( vI b ) @vij
D
@e0 ( v ) @vij
Cb
@b ( v ) @vij
,
we obtain lim
vij !0C
@e ( vI b ) @vij
D ¡1
and
lim
vij !1¡
@e ( vI b ) @vij
D 1.
@e0 ( v ) @vij
is
Traveling Salesman Problem
307
For any given b > 0, if v¤ is a local minimum point of equation 2.5, v is an interior point of P, that is, 0 < v¤ij < 1, i D 1, 2, . . . , n, j D 1, 2, . . . , n.1 Lemma 1. ¤
Let 0 r
L (v, l , l
c)
D e ( vI b ) C
n X
lri
@
1 n X
iD 1
j D1
vij ¡ 1A C
n X
³ ljc
n X iD 1
jD 1
´ vij ¡ 1 .
Lemma 1 indicates that if v¤ is a local minimum point of equation 2.5, then there exist lr¤ and lc¤ satisfying
rv L ( v¤ , lr¤ , lc¤ ) D 0, Pn ¤ i D 1, 2, . . . , n, j D1 vij D 1, Pn ¤ j D 1, 2, . . . , n, i D 1 vij D 1, where ³
rv L ( v, lr , lc ) D
@L ( v, lr , lc ) @L ( v, lr , lc ) @v11
...,
,
@v12
, ...,
@L ( v, lr , lc ) @L ( v, lr , lc ) @vn1
,
@vn2
@L ( v, lr , lc )
, ...,
@v1n
,
@L ( v, lr , lc )
´>
@vnn
with @L ( v, lr , lc ) @vij
D
@e 0 ( v ) @vij
C lri C ljc C b ln
vij , 1 ¡ vij
i D 1, 2, . . . , n, j D 1, 2, . . . , n. Let b k, k D 1, 2, . . . , be a sequence of positive numbers satisfying b1 > b2 > ¢ ¢ ¢ and limk!1 b k D 0. For k D 1, 2, . . . , let v (b k ) denote a global minimum point of equation 2.5 with b D b k. Theorem 1.
For k D 1, 2, . . . ,
e0 (v (b k ) ) ¸ e 0 ( v (b kC 1 ) ) , and any limit point of v (b k ) , k D 1, 2, . . . , is a global minimum point of equation 2.3. 1 All the proofs of lemmas and theorems in this article can be found on-line at www.cityu.edu.hk/meem/mecdang.
308
Chuangyin Dang and Lei Xu
This theorem indicates that a global minimum point of equation 2.3 can be obtained if we are able to generate a global minimum point of equation 2.5 for a sequence of descending values of the barrier parameter with zero limit. For k D 1, 2, . . . , let vk be a local minimum point of equation 2.5 b b with D k . For any limit point v¤ of v k , k D 1, 2, . . . , if there are no lr D (lr1 , lr2 , . . . , lrn ) > and lc D (lc1 , lc2 , . . . , lcn ) > satisfying Theorem 2.
@e0 ( v¤ ) @vij
C lri C ljc D 0,
i D 1, 2, . . . , n, j D 1, 2, . . . , n, then v¤ is a local minimum point of equation 2.3. This theorem indicates that at least a local minimum point of equation 2.3 can be obtained if we are able to generate a local minimum point of equation 2.5 for a sequence of descending values of the barrier parameter with zero limit. 3 The Method
Stimulated from the results in the previous section, we propose in this section a method for approximating a solution of equation 2.3. The idea of the method is as follows: Choose b 0 to be a sufciently large, positive number satisfying that e ( vI b 0 ) is strictly convex. Let b q , q D 0, 1, . . . , be a sequence of positive numbers satisfying b 0 > b1 > ¢ ¢ ¢
and limq!1 bq D 0. Choose v¤, 0 to be the unique minimum point of equation 2.5 with b D b 0 . For q D 1, 2, . . . , starting at v¤,q¡1 , we search for a minimum point v¤,q of equation 2.5 with b D bq . Given any b > 0, consider the rst-order necessary optimality condition for equation 2.5:
rv L (v, lr , lc ) D 0, Pn i D 1, 2, . . . , n, j D1 vij D 1, Pn j D 1, 2, . . . , n. i D1 vij D 1, From @L ( v, lr , lc ) @vij
D
@e 0 ( v ) @vij
vij
C lri C ljc C b ln D 0, 1 ¡ vij
we obtain vij D
1 ( ) 1 C exp( ( @e@0vijv C lri C ljc ) / b )
.
Traveling Salesman Problem
Let ri D exp vij D
± r² li b
309
³ c´ lj
and cj D exp 1
1 C ri cj exp( @e@0v(ijv ) / b )
b
. Then,
.
For convenience of the following discussions, let aij (v ) D exp Then, vij D
1 . 1 C ri cj aij ( v)
±
@e0 ( v ) @vij
/b
² .
(3.1)
P P Substituting equation 3.1 into jnD 1 vij D 1, i D 1, 2, . . . , n, and niD1 vij D 1, j D 1, 2, . . . , n, we obtain Pn 1 i D 1, 2, . . . , n, j D1 1 C ri cj aij ( v) D 1, (3.2) Pn 1 j D 1, 2, . . . , n. i D1 1 C ri cj aij ( v ) D 1, Based on the above notations, a conceptual algorithm was proposed in Xu (1995) for approximating a solution of equation 2.3, which is as follows: Fix r and c. Use equation 3.1 to obtain v. Fix v. Solve equation 3.2 for r and c. Let hij (v, r, c ) D
1 1 C ri cj aij ( v )
and h ( v, r, c) D (h11 ( v, r, c) , h12 ( v, r, c) , . . . , h1n ( v, r, c ) ,
. . . , hn1 ( v, r, c) , hn2 ( v, r, c) , . . . , hnn ( v, r, c) ) > .
If v is an interior point of B, the following lemma shows that h ( v, r, c) ¡ v is a descent direction of L ( v, lr , lv ) . Lemma 2.
Assume 0 < vij < 1, i D 1, 2, . . . , n, j D 1, 2, . . . , n.
1.
@L ( v,lr ,lv ) @vij
> 0 if hij ( v, r, c) ¡ vij < 0.
2.
@L ( v,lr ,lv ) @vij
< 0 if hij ( v, r, c) ¡ vij > 0.
3.
@L ( v,lr ,lv ) @vij
D 0 if hij ( v, r, c ) ¡ vij D 0.
6 0. 4. ( h (v, r, c) ¡ v ) > rv L (v, lr , lc ) < 0 if h ( v, r, c) ¡ v D
P 6 0 and nkD 1 ( hik ( v, r, c) ¡ 5. ( h (v, r, P c) ¡ v ) > rv e ( vI b ) < 0 if h ( v, r, c ) ¡ v D vik ) D nkD1 ( hkj (v, r, c ) ¡ vkj ) D 0, i D 1, 2, . . . , n, j D 1, 2, . . . , n.
310
Chuangyin Dang and Lei Xu r
v
( ,l ) We only need to show that @L v,l > 0 if hij ( v, r, c) ¡ vij < 0. The @vij rest can be obtained similarly or in a straightforward manner. From
Proof.
1
hij ( v, r, c) ¡ vij D
1 C ri cj aij ( v )
¡ vij < 0,
we obtain 1 < ri cj aij ( v )
vij . 1 ¡ vij
(3.3)
Applying the natural logarithm, ln, to both sides of equation 3.3, we get ³ 0 < ln ri cj aij ( v)
vij 1 ¡ vij
´
vij 1 ¡ vij vij 1 @e0 ( v) 1 r 1 c 1 @L ( v, lr , lc ) C li C lj C ln . D D b @vij b b 1 ¡ vij b @vij D ln aij ( v ) C ln ri C ln cj C ln
Thus, @L ( v, lr , lc ) @vij
> 0.
The lemma follows. Since 0 < hij ( v, r, c) < 1, we note that the descent direction h ( v, r, c) ¡ v has a desired property that any point generated along h ( v, r, c) ¡ v satises automatically the lower and upper bounds if v 2 B and the step length is a number between zero and one. For any given point v, we use ( r ( v) , c ( v) ) to denote a positive solution of equation 3.2. Let v be an interior point of P. In order for h ( v, r, c) ¡ v to become a feasible descent direction of equation 2.5, we need to compute a positive solution (r ( v) , c (v ) ) of equation 3.2. Let 0 0 12 n n X 1 BX 1 @ ¡ 1A f ( r, c) D @ 2 iD 1 jD 1 1 C ri cj aij ( v )
³
C
n n X X jD 1
iD 1
1 ¡1 1 C ri cj aij ( v)
´2 1 A.
Traveling Salesman Problem
311
Observe that the value of f ( r, c) equals zero only at a solution of equation 3.2. For i D 1, 2, . . . , n, let 0
xi (r, c) D ri @
n X j D1
1 1 ¡ 1A , 1 C ri cj aij (v )
and for j D 1, 2, . . . , n, let
³
yj (r, c) D cj
n X i D1
´
1 ¡1 . 1 C ri cj aij ( v )
Let x ( r, c) D ( x1 ( r, c) , x2 ( r, c) , . . . , xn ( r, c) ) > y (r, c) D ( y1 ( r, c) , y2 ( r, c) , . . . , yn ( r, c) ) > . One can easily prove that ³ ´ x ( r, c) y ( r, c) is a descent direction of f ( r, c). For any given v, based on this descent direction, the following iterative procedure is proposed for computing a positive solution ( r ( v ) , c ( v ) ) of equation 3.2. Take ( r0, c0 ) to be an arbitrary positive vector, and for k D 0, 1, . . . , let rk C 1 D rk C m k x ( r k , c k ) , c k C 1 D c k C m k y ( rk , c k ) ,
(3.4)
where m k is a number in [0, 1] satisfying f ( rkC 1 , ckC 1 ) D min f (r k C m kx ( rk, ck ) , ck C m ky ( rk, ck ) ). m 2[0,1]
Observe that ( rk, ck ) > 0, k D 0, 1, . . . . There are many ways to determine m k (Minoux, 1986). For example, one can simply choose m k to be any number P in (0, 1] satisfying lkD 0 m l ! 1 and m k ! 0 as k ! 1. We have found in our numerical tests that when m k is any xed number in (0, 1], the iterative procedure, equation 3.4, converges to a positive solution of equation 3.2. For any given v, every limit point of ( rk, ck ) , k D 0, 1, . . . , generated by the iterative procedure, equation 3.4, is a positive solution of equation 3.2. Theorem 3.
312
Chuangyin Dang and Lei Xu
Based on the feasible descent direction, h (v, r ( v) , c (v ) ) ¡v, and the iterative procedure, equation 3.4, we have developed a method for approximating a solution of equation 2.3, which can be stated as follows: Step 0: Let 2 > 0 be a given tolerance. Let b 0 be a sufciently large, positive
number satisfying that e (vI b 0 ) is convex. Choose an arbitrary interior point vN 2 B, and two arbitrary positive vectors, r 0 and c0. Take an arbitrary positive numberg 2 (0, 1) (in general,g should be close to one). Given v D N use equation 3.4 to obtain a positive solution (r( vN ), c (vN ) ) of equation 3.2. v, Let r 0 D r ( vN ) and c0 D c ( vN ). Let 0 0 0 0 0 0 )> v 0 D ( v11 , v12 , . . . , v1n , . . . , vn1 , vn2 , . . . , vnn
with vij0 D
1 , 1 C ri ( vN ) cj (vN ) aij (vN )
where i D 1, 2, . . . , n, j D 1, 2, . . . , n. Let q D 0 and k D 0, and go to step 1.
Step 1: Given v D v k , use equation 3.4 to obtain a positive solution (r(v k ) ,
c (v k ) ) of equation 3.2. Let r 0 D r ( v k ) and c0 D c ( v k ) . Go to step 2.
Step 2: Let
h ( vk , r ( v k ) , c ( v k ) ) D ( h11 ( v k , r ( v k ) , c ( v k ) ) , h12 ( v k , r ( v k ) , c ( v k ) ) , . . . ,
h1n ( vk , r (v k ) , c (v k ) ) , . . . , hn1 ( v k, r ( vk ) , c ( vk ) ) , hn2 ( vk , r ( v k ) , c ( v k ) ) , . . . , hnn (v k, r ( vk ) , c ( vk ) ) ) >
with hij ( v k, r ( vk ) , c (v k ) ) D
1 1
C ri ( vk ) cj ( v k )aij ( v k )
,
where i D 1, 2, . . . , n, j D 1, 2, . . . , n. If kh(vk, r ( v k ) , c ( v k ) ) ¡ v kk < 2 , do as follows: If bq is sufciently small, the method terminates. Otherwise, let v¤,q D v k, v 0 D v k, bq C 1 D gb q , q D q C 1, and k D 0, and go to step 1. If kh(v k, r (v k ) , c (v k ) ) ¡ v kk ¸ 2 , do as follows: Compute v k C 1 D v k C h k ( h ( vk , r ( v k ) , c ( v k ) ) ¡ vk ) , where hk is a number in [0, 1] satisfying e (v kC 1 I bq ) D min e ( vk C h ( h (v k, r ( v k ) , c ( vk ) ) ¡ v k ) I b q ) . h2[0,1]
Let k D k C 1, and go to step 1.
(3.5)
Traveling Salesman Problem
313
Note that an exact positive solution ( r ( vk ) , c ( vk ) ) of equation 3.2 for v D v k and an exact solution of minh2[0,1] e (v k C h ( h ( v k, r ( vk ) , c ( vk ) ) ¡ v k ) I bq ) are not required in the implementation of the method, and their approximate solutions will do. There are many ways to determine hk (Minoux, 1986). For example, one can simply choose hk to be any number in (0, 1] satisfying Pk l D 0 hl ! 1 and h k ! 0 as k ! 1. The method is insensitive to the starting point since e ( v, b 0 ) is convex over B. For b D bq , every limit point of v k, k D 0, 1, . . . , generated by equation 2.5 is a stationary point of equation 3.5. Theorem 4.
Although it is difcult to prove that for any given b > 0, a limit point of v k, k D 0, 1, . . . , generated by equation 3.5 is at least a local minimum point of equation 2.5, in general, it is indeed at least a local minimum point of equation 2.5. Theorem 2 implies that every limit point of v¤,q , q D 0, 1, . . . , is at least a local minimum point of equation 2.3 if v¤,q is a minimum point of equation 2.5 with b D bq . For b D bq , our method can be proved to converge to a stationary point of equation 2.5 for any given r; however, the softassign algorithm can be proved to converge to a stationary point of equation 2.5 only if r is sufciently large so that e0 ( v) is strictly concave on the null space of the constraint matrix (Rangarajan, Yuille, & Mjolsness, 1999). Numerical tests also show that the softassign algorithm does not converge to a stationary point of equation 2.5 if the condition is not satised. Thus, for the softassign algorithm to converge, one has to determine the size of r through estimating the maximum eigenvalue of the matrix of the objective function of equation 2.1, which requires some extra computational work. As we pointed out, the size of r affects the quality of a solution generated by a deterministic annealing algorithm, and it should be as small as possible. Since our method converges for any r, one can start with a smaller positive r and then increase r if the solution generated by the method is not a near integer solution. In this respect, our method is better than the softassign algorithm. Numerical results support this argument. 4 Numerical Results
The method has been used to approximate solutions of a number of TSP instances. The method succeeds in nding an optimal or near-optimal tour for each of the TSP instances. In our implementation of the method, 1. 2 D 0.01 and b 0 D 200. 2. We take r 0 D ( r10 , r20, . . . , rn0 ) > and c0 D ( c10 , c20 , . . . , cn0 ) > to be two random vectors satisfying 0 < ri0 < 1 and 0 < ci0 < 1, i D 1, 2, . . . , n.
314
Chuangyin Dang and Lei Xu
3. m k D 0.95, and for anyqgiven v, the iterative procedure, equation 3.4, terminates as soon as
f ( rk, ck ) < 0.001.
4. We replace e ( xI b ) with L ( v, lr , lc ) in the method since ( r ( v k ) , c ( v k ) ) is an approximate solution of equation 3.2. 5. hk is determined with the following Armijo-type line search: hk D j mk , with m k being the smallest nonnegative integer satisfying L ( v k C j mk (h ( vk , r ( v k ) , c ( v k ) ) ¡ vk ) , lr,k, lc, k ) · L ( vk , lr,k , lc,k )
C j mk c ( h ( v k , r ( v k ) , c ( v k ) ) ¡ v k ) > rv L ( v k , lr, k , lc, k ) ,
where j and c are any two numbers in (0, 1) (we set j D 0.6 and c D 0.8), lr, k D bq (ln r1 ( vk ) , ln r2 ( vk ) , . . . , ln rn ( v k ) ) > , and lc, k D b q (ln c1 (v k ) , ln c2 ( vk ) , . . . , ln cn ( v k ) ) > . The method terminates as soon as bq < 1. To produce a solution of higher quality, the size of r should be as small as possible. However, a small r may lead to a fractional solution v¤,q . To make sure that an integer solution will be generated, we continue the following procedure: Step 0: Let b D 1, v 0 D v¤,q , and k D 0. Go to step 1.
Step 1: Let v¤ D ( v¤11, v¤12 , . . . , v¤1n , . . . , v¤n1 , v¤n2 , . . . , v¤nn ) > with
(
v¤ij
D
1 0
if vijk ¸ 0.9,
if vijk < 0.9,
i D 1, 2, . . . , n, j D 1, 2, . . . , n. If v¤ 2 P, the procedure terminates. Otherwise, let r D r C 2, and go to step 2. Step 2: Given v D vk , use equation 3.4 to obtain a positive solution ( r ( v k ) , c ( v k ) )
of equation 3.2. Let r0 D r ( vk ) , c0 D c ( v k ) ,
lr, k D (ln r1 ( vk ) , ln r2 ( vk ) , . . . , ln rn ( v k ) ) > , and lc,k D (ln c1 ( v k ) , ln c2 ( vk ) , . . . , ln cn ( vk ) ) > . Go to step 3.
Traveling Salesman Problem
315
Step 3: Let
h ( v k, r (v k ) , c (v k ) ) D ( h11 ( vk , r ( v k ) , c ( v k ) ) , h12 ( vk , r ( v k ) , c ( v k ) ) , . . . ,
h1n ( v k, r ( vk ) , c ( vk ) ) , . . . , hn1 (v k, r ( v k ) , c ( v k ) ) ,
hn2 ( v k, r ( vk ) , c ( vk ) ) , . . . , hnn ( v k, r ( vk ) , c ( vk ) ) ) > with hij (v k, r ( v k ) , c ( vk ) ) D
1 , 1 C ri ( v k ) cj ( v k ) aij (v k )
where i D 1, 2, . . . , n, j D 1, 2, . . . , n. If kh(v k, r ( v k ) , c ( v k ) ) ¡ v kk < 2 , let v 0 D v k and k D 0, and go to step 1. Otherwise, compute vk C 1 D v k C h k ( h ( v k , r ( v k ) , c ( v k ) ) ¡ v k ) , where h k is a number in [0, 1] satisfying L ( v kC 1 , lr, k, lc, k ) D min L ( vk C h ( h ( v k, r ( vk ) , c ( vk ) ) ¡ v k ) , lr, k, lc,k ) . h2[0,1]
Let k D k C 1, and go to step 2. The method is programmed in MATLAB. To compare the method with the softassign algorithm proposed in Gold et al. (1994) and Rangarajan et al. (1996, 1999) and the softassign algorithm modied by introducing line search, the softassign algorithm and its modied version are also programmed in MATLAB. All our numerical tests are done on a PC computer. In the presentations of numerical results, DM stands for our method, SA the softassign algorithm, MSA the modied version of the softassign algorithm, CT the computation time in seconds, OPT the length of an optimal tour, OBJ the length of a tour generated by an algorithm, OBJD the length of the tour generated by our method, OBJSA the length of the tour generated by the ¡OPT softassign algorithm or its modied version, and RE D OBJOPT . Numerical results are as follows. These ten TSP instances are from a well-known web site, TSPLIB. We have used the method, the softassign algorithm, and the modied softassign algorithm to approximate solutions of these TSP instances. Numerical results are presented in Figures 1, 2, 3, and 4 and Table 1, where the softassign algorithm fails to converge when r D 30. Example 1.
These TSP instances have 100 cities and are generated randomly. Every city is a point in a square with integer coordinates ( x, y) Example 2.
316
Chuangyin Dang and Lei Xu
Figure 1: Relative error to optimal tour. 1. bays29, 2. att48, 3. eil51, 4. berlin52, 5. st70, 6. eil76, 7. pr76, 8. rd100, 9. eil101, 10. lin105.
Figure 2: Computation time for different algorithms. 1. bays29, 2. att48, 3. eil51, 4. berlin52, 5. st70, 6. eil76, 7. pr76, 8. rd100, 9. eil101, 10. lin105.
Traveling Salesman Problem
317
Figure 3: Relative error to optimal tour. 1. bays29, 2. att48, 3. eil51, 4. berlin52, 5. st70, 6. eil76, 7. pr76, 8. rd100, 9. eil101, 10. lin105.
Figure 4: Computation time for different algorithms. 1. bays29, 2. att48, 3. eil51, 4. berlin52, 5. st70, 6. eil76, 7. pr76, 8. rd100, 9. eil101, 10. lin105.
0.95
0.9
g
22 3093 53
355 852 56
CT OBJ RE(%)
256 923 69
CT OBJ RE(%)
CT OBJ RE(%)
7 2404 19
CT OBJ RE(%)
DM r D 80
DM r D 30
22 2045 1
157 566 4
30 2045 1
131 638 17
237 567 4
eil76, n D 76
13 2472 22
bays29, n D 29
89 671 23
eil76, n D 76
9 2430 20
bays29, n D 29
SA r D 80
476 189,949 76
104 36,293 8
133 169,769 57
pr76, n D 76
67 36,106 8
223 118,459 10
79 34,736 4
187 120,417 11
64 34,736 4
DM r D 30
att48, n D 48
101 142,331 32
pr76, n D 76
53 35,434 6
DM r D 80
att48, n D 48
885 196,520 82
109 36,943 10
SA r D 80
367 10,483 28
53 588 37
412 9172 12
76 449 4
eil51, n D 51
436 8626 5
50 464 8
rd100, n D 100
2382 13,506 65
59 572 33
240 12,788 56
32 501 17
DM r D 30
eil51, n D 51
DM r D 80
rd100, n D 100
636 13,223 61
55 591 37
SA r D 80
Table 1: Numerical Results for TSP Instances from TSPLIB.
1193 988 54
416 926 44
473 680 6
89 7837 4
eil101, n D 101
55 8857 17
362 687 7
berlin52, n D 52
229 862 34
87 8205 9
DM r D 30
eil101, n D 101
121 11,793 56
2525 1091 70
51 10017 33
DM r D 80
berlin52, n D 52
115 10,494 39
SA r D 80
641 682 19,532 16,788 36 17 (continued)
189 690 2
522 17,278 20
lin105, n D 105
109 867 28
st70, n D 70
1020 17,310 20
1052 1198 76
505 19,080 33
138 710 5
DM r D 30
lin105, n D 105
66 887 31
DM r D 80
st70, n D 70
1011 18,272 27
835 1174 73
SA r D 80
318 Chuangyin Dang and Lei Xu
0.95
0.9
CT OBJ RE(%)
CT OBJ RE(%)
CT OBJ RE(%)
CT OBJ RE(%)
att48, n D 48
pr76, n D 76
eil76, n D 76
252 581 6.6
217 567 4.0
22 2148 6.3
291 119,332 10.3
93 35,068 4.6
212 118,453 9.5
75 34,735 3.6
207 111,862 3.4
bays29, n D 29
21 2124 5.2
226 124,408 15.0
pr76, n D 76
167 583 7.0
22 2045 1.2
66 34,736 3.6
DM r D 30
eil76, n D 76
223 574 5.3
16 2045 1.2
80 34,903 4.1
MSA r D 30
att48, n D 48
DM r D 30
bays29, n D 29
MSA r D 30
Table 1: (continued).
449 8761 6.8
rd100, n D 100
910 8930 8.8
85 468 8.8
eil51, n D 51
68 461 7.2
358 8999 9.7
rd100, n D 100
708 9287 13.2
100 448 4.2
DM r D 30
eil51, n D 51
52 471 9.5
MSA r D 30
522 678 5.6
eil101, n D 101
750 694 8.1
99 8373 11.0
berlin52, n D 52
110 8214 8.9
374 703 9.5
eil101, n D 101
564 694 8.1
76 8047 6.7
DM r D 30
berlin52, n D 52
111 8588 13.8
MSA r D 30
673 18,581 29.2
lin105, n D 105 877 16,694 16.1
188 704 3.7
st70, n D 70 253 780 14.9
436 16,498 14.7
lin105, n D 105
978 16,676 15.9
166 756 11.3
DM r D 30
st70, n D 70
206 713 5.0
MSA r D 30
Traveling Salesman Problem 319
320
Chuangyin Dang and Lei Xu
satisfying 0 · x · 100 and 0 · y · 100. We have used the method, the softassign algorithm, and the modied softassign algorithm to approximate solutions of a number of TSP instances. Numerical results are presented in Table 2, where the softassign algorithm fails to converge when r D 30. From these numerical results, one can see that our method seems more effective and efcient than the softassign algorithm. Comparing our method with the softassign algorithm modied by introducing line search, one can nd that our method is signicantly superior to the modied softassign algorithm in computational time, although the quality of solutions generated by our method on average is only slightly better than those generated by the modied softassign algorithm. The reason that our method is faster than the softassign algorithm and its modied version lies in the procedures for updating Lagrange multipliers. Our procedure for updating Lagrange multipliers is much more efcient than Sinkhorn’s approach adopted in the softassign algorithm for updating Lagrange multipliers. Although our method has advantages over the softassign algorithm and its modied version, it still may not compete with the elastic net and nonneural algorithms for TSP. The idea presented here for constructing a procedure to update Lagrange multipliers can also be applied to solving more complicated problems. 5 Conclusion
We have developed a Lagrange multiplier and Hopeld-type barrier function method for approximating a solution of the TSP. Some theoretical results have been derived. For any given barrier parameter, we have proved that the method converges to a stationary point of equation 2.5 without any condition on the objective function, which is stronger than the convergence result for the softassign algorithm. The numerical results show that the method seems more effective and efcient than the softassign algorithm. The method would be improved with a faster iterative procedure to update Lagrange multipliers for obtaining a feasible descent direction. Acknowledgments
We thank the anonymous referees for their constructive comments and remarks, which have signicantly improved the quality of this article. The preliminary version of this article was completed when C.D. was on leave at the Chinese University of Hong Kong from 1997 to 1998. The work was supported by SRG 7001061 of CityU, CUHK Direct Grant 220500680, Ho Sin-Hang Education Endowment Fund HSH 95 / 02, and Research Fellow and Research Associate Scheme of the CUHK Research Committee.
0.95
0.9
g
OBJD OBJSA
CT OBJ
OBJD OBJSA
CT OBJ
OBJD OBJSA
CT OBJ
OBJD OBJSA
CT OBJ
3496 1361
905 1359
6553 1412
1019 1327
SA r D 80
282 927 0.68
6
486 1262 0.93
1
365 976 0.69
6
209 1172 0.88
1
DM r D 80
455 817 0.60
421 814 0.60
397 821 0.58
343 814 0.61
DM r D 30
565 1397
834 1262
633 1345
895 1240
SA r D 80
279 1157 0.83
7
285 1088 0.86
2
275 1256 0.93
7
254 1084 0.87
2
DM r D 80
437 870 0.62
430 896 0.71
357 870 0.65
409 896 0.72
DM r D 30
628 1299
4960 1586
641 1304
3053 1430
SA r D 80
281 1252 0.96
8
320 989 0.62
3
297 1200 0.92
8
336 987 0.69
3
DM r D 80
Table 2: Numerical Results for TSP Instances Generated Randomly.
413 889 0.68
437 867 0.55
332 889 0.68
313 867 0.61
DM r D 30
1473 1237
1928 1347
1399 1196
1371 1396
SA r D 80
274 1015 0.82
9
321 995 0.74
4
190 1059 0.89
9
249 1040 0.75
4
DM r D 80
438 810 0.66
488 808 0.60
374 810 0.68
418 808 0.58
DM r D 30
974 1371
630 1254
843 1366
798 1295
SA r D 80
266 1057 0.77
10
400 1053 0.84
5
197 1115 0.82
10
250 1129 0.87
5
400 869 0.69
464 856 0.63
311 869 0.67
DM r D 30
570 856 0.62 (continued)
DM r D 80
Traveling Salesman Problem 321
0.95
0.9
603 890
642 865
CT OBJ
417 902
CT OBJ
CT OBJ
527 887
CT OBJ
MSA r D 30
495 806
6
420 874
1
405 885
6
364 814
1
DM r D 30
Table 2: (continued).
0.93
0.98
0.98
0.92
OBJD OBJSA
700 824
636 908
537 904
617 950
MSA r D 30
471 782
7
418 846
2
297 879
7
409 912
2
DM r D 30
0.95
0.93
0.97
0.96
OBJD OBJSA
666 872
639 869
471 827
381 942
MSA r D 30
425 850
8
511 861
3
370 901
8
500 832
3
DM r D 30
0.97
0.99
1.09
0.88
OBJD OBJSA
654 849
791 894
603 815
513 931
MSA r D 30
584 852
9
476 898
4
363 827
9
341 874
4
DM r D 30
1.00
1.00
1.02
0.94
OBJD OBJSA
625 1012
742 883
719 846
846 795
MSA r D 30
417 822
10
461 869
5
336 889
10
415 789
5
DM r D 30
0.81
0.98
1.05
0.99
OBJD OBJSA
322 Chuangyin Dang and Lei Xu
Traveling Salesman Problem
323
References Aiyer, S., Niranjan, M., & Fallside, F. (1990). A theoretical investigation into the performance of the Hopeld model. IEEE Transactions on Neural Networks, 1, 204–215. Cichocki, A., & Unbehaunen, R. (1993). Neural networks for optimization and signal processing. New York: Wiley. Durbin, R., & Willshaw, D. (1987). An analogue approach to the traveling salesman problem using an elastic network method. Nature, 326, 689–691. Gee, A., Aiyer, S., & Prager, R. (1993). An analytical framework for optimizing neural networks. Neural Networks, 6, 79–97. Gee, A., & Prager, R. (1994). Polyhedral combinatorics and neural networks. Neural Computation, 6, 161–180. Gold, S., Mjolsness, E., & Rangarajan, A. (1994). Clustering with a domainspecic distance measure. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 96–103). San Mateo, CA: Morgan Kaufmann. Gold, S., & Rangarajan, A. (1996). Softassign versus softmax: Benchmarking in combinatorial optimization. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 626–632). Cambridge, MA: MIT Press. Hopeld, J. (1984).Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Sciences of the USA, 81, 3088–3092. Hopeld, J., & Tank, D. (1985). Neural computation of decisions in optimization problems. Biological Cybernetics, 52, 141–152. Lawler, E. L., Lenstra, J. K., Rinnooy Kan, A. H. G., & Shmoys, D. B. (1985). The traveling salesman problem. New York: Wiley. Minoux, M. (1986). Mathematical programming: Theory and algorithms. New York: Wiley. Peterson, C., & Soderberg, B. (1989). A new method for mapping optimization problems onto neural networks. International Journal of Neural Systems, 1, 3– 22. Rangarajan, A., Gold, S., & Mjolsness, E. (1996). A novel optimizing network architecture with applications. Neural Computation, 8, 1041–1060. Rangarajan, A., Yuille, A., & Mjolsness, E. (1999). Convergence properties of the softassign quadratic assignment algorithm. Neural Computation, 11, 1455– 1474. Simic, P. (1990). Statistical mechanics as the underlying theory of “elastic” and “neural” optimizations. Networks, 1, 89–103. Urahama, K. (1996). Gradient projection network: Analog solver for linearly constrained nonlinear programming. Neural Computation, 6, 1061–1073. van den Berg, J. (1996). Neural relaxation dynamics. Unpublished doctoral dissertation, Erasmus University of Rotterdam, Rotterdam, Netherlands. van den Bout, D., & Miller, T., III (1990). Graph partitioning using annealed networks. IEEE Transactions on Neural Networks, 1, 192–203.
324
Chuangyin Dang and Lei Xu
Wacholder, E., Han, J., & Mann, R. (1989). A neural network algorithm for the multiple traveling salesman problem. Biological Cybernetics, 61, 11–19. Waugh, F., & Westervelt, R. (1993). Analog neural networks with local competition: I. Dynamics and stability. Physical Review E, 47, 4524–4536. Wolfe, W., Parry, M., & MacMillan, J. (1994). Hopeld-style neural networks and the TSP. In Proceedings of the IEEE International Conference on Neural Networks, 7 (pp. 4577–4582). Piscataway, NJ: IEEE Press. Xu, L. (1994). Combinatorial optimization neural nets based on a hybrid of Lagrange and transformation approaches. In Proceedings of the World Congress on Neural Networks (pp. 399–404). San Diego, CA. Xu, L. (1995). On the hybrid LT combinatorial optimization: New U-shape barrier, sigmoid activation, least leaking energy and maximum entropy. In Proceedings of the International Conference on Neural Information Processing (ICONIP’95) (pp. 309–312). Beijing: Publishing House of Electronics Industry. Yuille, A., & Kosowsky, J. (1994). Statistical physics algorithms that converge. Neural Computation, 6, 341–356.
Received March 29, 1999; accepted April 27, 2001.
NOTE
Communicated by Jonathan Victor
The Time-Rescaling Theorem and Its Application to Neural Spike Train Data Analysis Emery N. Brown
[email protected] Neuroscience Statistics Research Laboratory, Department of Anesthesia and Critical Care, Massachusetts General Hospital, Boston, MA 02114, U.S.A., and Division of Health Sciences and Technology, Harvard Medical School / Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A. Riccardo Barbieri
[email protected] Neuroscience Statistics Research Laboratory, Department of Anesthesia and Critical Care, Massachusetts General Hospital, Boston, MA 02114, U.S.A. Val Âerie Ventura
[email protected] Robert E. Kass
[email protected] Department of Statistics, Carnegie Mellon University, Center for the Neural Basis of Cognition, Pittsburg, PA 15213, U.S.A. Loren M. Frank
[email protected] Neuroscience Statistics Research Laboratory, Department of Anesthesia and Critical Care, Massachusetts General Hospital, Boston, MA 02114, U.S.A. Measuring agreement between a statistical model and a spike train data series, that is, evaluating goodness of t, is crucial for establishing the model’s validity prior to using it to make inferences about a particular neural system. Assessing goodness-of-t is a challenging problem for point process neural spike train models, especially for histogram-based models such as perstimulus time histograms (PSTH) and rate functions estimated by spike train smoothing. The time-rescaling theorem is a wellknown result in probability theory, which states that any point process with an integrable conditional intensity function may be transformed into a Poisson process with unit rate. We describe how the theorem may be used to develop goodness-of-t tests for both parametric and histogrambased point process models of neural spike trains. We apply these tests in two examples: a comparison of PSTH, inhomogeneous Poisson, and inhomogeneous Markov interval models of neural spike trains from the supNeural Computation 14, 325–346 (2001)
° c 2001 Massachusetts Institute of Technology
326
E. N. Brown, R. Barbieri, V. Ventura, R. E. Kass, and L. M. Frank
plementary eye eld of a macque monkey and a comparison of temporal and spatial smoothers, inhomogeneous Poisson, inhomogen eous gamma, and inhomogeneous inverse gaussian models of rat hippocampal place cell spiking activity. To help make the logic behind the time-rescaling theorem more accessible to researchers in neuroscience, we present a proof using only elementary probability theory arguments. We also show how the theorem may be used to simulate a general point process model of a spike train. Our paradigm makes it possible to compare parametric and histogram-based neural spike train models directly. These results suggest that the time-rescaling theorem can be a valuable tool for neural spike train data analysis. 1 Introduction
The development of statistical models that accurately describe the stochastic structure of neural spike trains is a growing area of quantitative research in neuroscience. Evaluating model goodness of t—that is, measuring quantitatively the agreement between a proposed model and a spike train data series—is crucial for establishing the model’s validity prior to using it to make inferences about the neural system being studied. Assessing goodness of t for point neural spike train models is a more challenging problem than for models of continuous-valued processes. This is because typical distance discrepancy measures applied in continuous data analyses, such as the average sum of squared deviations between recorded data values and estimated values from the model, cannot be directly computed for point process data. Goodness-of-t assessments are even more challenging for histogram-based models, such as peristimulus time histograms (PSTH) and rate functions estimated by spike train smoothing, because the probability assumptions needed to evaluate model properties are often implicit. Berman (1983) and Ogata (1988) developed transformations that, under a given model, convert point processes like spike trains into continuous measures in order to assess model goodness of t. One of the theoretical results used to construct these transformations is the time-rescaling theorem. A form of the time-rescaling theorem is well known in elementary probability theory. It states that any inhomogeneous Poisson process may be rescaled or transformed into a homogeneous Poisson process with a unit rate (Taylor & Karlin, 1994). The inverse transformation is a standard method for simulating an inhomogeneous Poisson process from a constant rate (homogeneous) Poisson process. Meyer (1969) and Papangelou (1972) established the general time-rescaling theorem, which states that any point process with an integrable rate function may be rescaled into a Poisson process with a unit rate. Berman and Ogata derived their transformations by applying the general form of the theorem. While the more general time-rescaling theorem is well known among researchers in point process theory (BrÂemaud, 1981; Jacobsen, 1982; Daley & Vere-Jones, 1988; Ogata, 1988; Karr, 1991), the
The Time-Rescaling Theorem
327
theorem is less familiar to neuroscience researchers. The technical nature of the proof, which relies on the martingale representation of a point process, may have prevented its signicance from being more broadly appreciated. The time-rescaling theorem has important theoretical and practical implications for application of point process models in neural spike train data analysis. To help make this result more accessible to researchers in neuroscience, we present a proof that uses only elementary probability theory arguments. We describe how the theorem may be used to develop goodnessof-t tests for both parametric and histogram-based point process models of neural spike trains. We apply these tests in two examples: a comparison of PSTH, inhomogeneous Poisson, and inhomogeneous Markov interval models of neural spike trains from the supplementary eye eld of a macque monkey and a comparison of temporal and spatial smoothers, inhomogeneous Poisson, inhomogeneous gamma, and inhomogeneous inverse gaussian models of rat hippocampal place cell spiking activity. We also demonstrate how the time-rescaling theorem may be used to simulate a general point process. 2 Theory 2.1 The Conditional Intensity Function and the Joint Probability Density of the Spike Train. Dene an interval (0, T], and let 0 < u1 < u2 < , . . . , < un¡1 < un · T be a set of event (spike) times from a point process.
For t 2 (0, T], let N ( t) be the sample path of the associated counting process. The sample path is a right continuous function that jumps 1 at the event times and is constant otherwise (Snyder & Miller, 1991). In this way, N (t) counts the number and location of spikes in the interval (0, t]. Therefore, it contains all the information in the sequence of events or spike times. For t 2 (0, T], we dene the conditional or stochastic intensity function as l(t | Ht ) D lim
D t!0
Pr (N ( t C D t) ¡ N (t ) D 1 | Ht ) , Dt
(2.1)
where Ht D f0 < u1 < u2 , . . . , uN ( t ) < tg is the history of the process up to time t and uN (t ) is the time of the last spike prior to t. If the point process is an inhomogeneous Poisson process, then l(t | Ht ) D l(t) is simply the Poisson rate function. Otherwise, it depends on the history of the process. Hence, the conditional intensity generalizes the denition of the Poisson rate. It is well known that l(t | Ht ) can be dened in terms of the event (spike) time probability density, f ( t | Ht ) , as l(t | Ht ) D
1¡
Rt
f (t | Ht )
uN (t)
f ( u | Ht ) du
,
(2.2)
for t > uN (t ) (Daley & Vere-Jones, 1988; Barbieri, Quirk, Frank, Wilson, & Brown, 2001). We may gain insight into equation 2.2 and the meaning of the
328
E. N. Brown, R. Barbieri, V. Ventura, R. E. Kass, and L. M. Frank
conditional intensity function by computing explicitly the probability that given Ht , a spike u k occurs in [t, t C D t) where k D N ( t) C 1. To do this, we note that the events fN(t C D t) ¡ N ( t) D 1 | Ht g and fu k 2 [t, t C D t) | Ht g are equivalent and that therefore, Pr ( N ( t C D t) ¡ N ( t) D 1 | Ht ) D Pr(u k 2 [t, t C D t ) | u k > t, Ht ) Pr(u k 2 [t, t C D t ) | Ht ) D Pr ( u k > t | Ht ) R tC D t f ( u | Ht ) du t D Rt 1 ¡ uN (t) f (u | Ht ) du ¼
1¡
f ( t | Ht ) D t Rt ( | ) uN (t) f u Ht du
D l(t | Ht ) D t.
(2.3)
Dividing by D t and taking the limit gives lim
D t!0
Pr ( u k 2 [t, t C D t) | Ht ) f ( t | Ht ) D Rt Dt 1 ¡ uN (t) f (u | Ht ) du D l(t | Ht ) ,
(2.4)
which is equation 2.1. Therefore, l(t | Ht ) D t is the probability of a spike in [t, t C D t ) when there is history dependence in the spike train. In survival analysis, the conditional intensity is termed the hazard function because in this case, l(t | Ht ) D t measures the probability of a failure or death in [t, tCD t) given that the process has survived up to time t (Kalbeisch & Prentice, 1980). Because we would like to apply the time-rescaling theorem to spike train data series, we require the joint probability density of exactly n event times in (0, T]. This joint probability density is (Daley & Vere-Jones, 1988; Barbieri, Quirk, et al., 2001) f ( u1 , u2 , . . . , un \ N ( T ) D n ) D f ( u1 , u2 , . . . , un \ un C 1 > T ) D f ( u1 , u2 , . . . , un \ N ( un ) D n ) Pr ( un C 1 > T | u1 , u2 , . . . , un ) » Z uk ¼ n Y l(u k | Huk ) exp ¡ l(u | Hu ) du D uk¡1
kD 1
( Z ¢ exp ¡
T un
) l(u | Hu ) du ,
(2.5)
The Time-Rescaling Theorem
329
where f ( u1 , u2 , . . . , un \ N ( un ) D n ) » Z n Y l(u k | Huk ) exp ¡ D kD 1
Pr(un C 1
uk
¼ l(u | Hu ) du
(2.6)
uk¡1
( Z > T | u1 , u2 , . . . , un ) D exp ¡
T
) l(u | Hu ) du ,
(2.7)
un
and u 0 D 0. Equation 2.6 is the joint probability density of exactly n events in (0, un ], whereas equation 2.7 is the probability that the n C 1st event occurs after T. The conditional intensity function provides a succinct way to represent the joint probability density of the spike times. We can now state and prove the time-rescaling theorem. Let 0 < u1 < u2 < , . . . , < un < T be a realization from a point process with a conditional intensity function l(t | Ht ) satisfying 0 < l(t | Ht ) for all t 2 (0, T]. Dene the transformation Z uk L ( uk ) D l(u | Hu ) du, (2.8) Time-Rescaling Theorem.
0
for k D 1, . . . , n, and assume L ( t) < 1 with probability one for all t 2 (0, T]. Then the L ( u k ) ’s are a Poisson process with unit rate. Let tk D L ( u k ) ¡ L (u k¡1 ) for k D 1, . . . , n and set tT D | l(u Hu ) du. To establish the result, it sufces to show that the tks un are independent and identically distributed exponential random variables with mean one. Because the tk transformation is one-to-one and tn C 1 > tT if and only if un C 1 > T, the joint probability density of the tk’s is Proof.
RT
f (t1 , t2 , . . . , tn \ tn C 1 > tT ) D f (t1 , . . . , tn ) Pr (tn C 1 > tT | t1 , . . . , tn ) .
(2.9)
We evaluate each of the two terms on the right side of equation 2.9. The following two events are equivalent: ftn C 1 > tT | t1 , . . . , tn g D fun C 1 > T | u1 , u2 , . . . , un g.
(2.10)
Hence Pr(tn C 1 > tT | t1 , t2 , . . . , tn ) D Pr(un C 1 > T | u1 , u2 , . . . , un ) ( Z ) T
D exp
¡
un
D expf¡tT g,
l(u | Hun ) du (2.11)
330
E. N. Brown, R. Barbieri, V. Ventura, R. E. Kass, and L. M. Frank
where the last equality follows from the denition of tT . By the multivariate change-of-variable formula (Port, 1994), f (t1 , t2 , . . . , tn ) D | J| f (u1 , u2 , . . . , un \ N ( un ) D n ) ,
(2.12)
where J is the Jacobian of the transformation between uj , j D 1, . . . , n and tk, k D 1, . . . , n. Because tk is a function of u1 , . . . , u k, J is a lower triangular matrix, and Q its determinant is the product of its diagonal elements dened as | J| D | nkD 1 Jkk |. By assumption 0 < l(t | Ht ) and by equation 2.8 and the denition of tk, the mapping of u into t is one-to-one. Therefore, by the inverse differentiation theorem (Protter & Morrey, 1991), the diagonal elements of J are Jkk D
@u k @tk
¡1 D l(u k | Huk ) .
(2.13)
Substituting |J| and equation 2.6 into equation 2.12 yields f (t1 , t2 , . . . , tn ) D
n Y
l(u k | Huk ) ¡1
kD 1
» Z ¢ exp ¡
D D
n Y
n Y kD 1 n Y kD 1
l(u k | Huk )
kD 1 uk
¼
l(u | Hu ) du
uk¡1
expf¡[L (u k ) ¡ L (u k¡1 )]g expf¡tkg.
(2.14)
Substituting equations 2.11 and 2.14 into 2.9 yields f (t1 , t2 , . . . , tn \ tn C 1 > tT ) D f (t1 , . . . , tn ) Pr(tn C 1 > tT | t1 , . . . , tn )
³
D
n Y kD 1
´
expf¡tkg expf¡tT g,
(2.15)
which establishes the result. The time-rescaling theorem generates a history-dependent rescaling of the time axis that converts a point process into a Poisson process with a unit rate. 2.2 Assessing Model Goodness of Fit. We may use the time-rescaling theorem to construct goodness-of-t tests for a spike data model. Once a
The Time-Rescaling Theorem
331
model has been t to a spike train data series, we can compute from its estimated conditional intensity the rescaled times tk D L ( u k ) ¡ L ( u k¡1 ) .
(2.16)
If the model is correct, then, according to the theorem, the tks are independent exponential random variables with mean 1. If we make the further transformation zk D 1 ¡ exp(¡t k ) ,
(2.17)
then zks are independent uniform random variables on the interval (0, 1). Because the transformations in equations 2.16 and 2.17 are both one-to-one, any statistical assessment that measures agreement between the zks and a uniform distribution directly evaluates how well the original model agrees with the spike train data. Here we present two methods: Kolmogorov-Smirnov tests and quantile-quantile plots. To construct the Kolmogorov-Smirnov test, we rst order the zks from smallest to largest, denoting the ordered values as z ( k) s. We then plot the values of the cumulative distribution function of the uniform density dened k¡ 1
as bk D n 2 for k D 1, . . . , n against the z ( k) s. If the model is correct, then the points should lie on a 45-degree line (Johnson & Kotz, 1970). Condence bounds for the degree of agreement between the models and the data may be constructed using the distribution of the Kolmogorov-Smirnov statistic. For moderate to large sample sizes the 95% (99%) condence bounds are well approximated as bk § 1.36 / n1/ 2 (bk § 1.63 / n1 / 2 ) (Johnson & Kotz, 1970). We term such a plot a Kolmogorov-Smirnov (KS) plot. Another approach to measuring agreement between the uniform probability density and the zks is to construct a quantile-quantile (Q-Q) plot (Ventura, Carta, Kass, Gettner, & Olson, 2001; Barbieri, Quirk, et al., 2001; Hogg & Tanis, 2001). In this display, we plot the quantiles of the uniform distribution, denoted here also as the bks, against the z ( k) s. As in the case of the KS plots, exact agreement occurs between the point process model and the experimental data if the points lie on a 45-degree line. Pointwise condence bands can be constructed to measure the magnitude of the departure of the plot from the 45-degree line relative to chance. To construct pointwise bands, we note that if the t ks are independent exponential random variables with mean 1 and the z ks are thus uniform on the interval (0, 1), then each z ( k) has a beta probability density with parameters k and n ¡ k C 1 dened as f ( z | k, n ¡ k C 1) D
n! z k¡1 (1 ¡ z ) n¡k, ( n ¡ k) ! ( k ¡ 1)!
(2.18)
for 0 < z < 1 (Johnson & Kotz, 1970). We set the 95% condence bounds by nding the 2.5th and 97.5th quantiles of the cumulative distribution
332
E. N. Brown, R. Barbieri, V. Ventura, R. E. Kass, and L. M. Frank
associated with equation 2.18 for k D 1, . . . , n. These exact quantiles are readily available in many statistical software packages. For moderate to large spike train data series, a reasonable approximation to the 95% (99%) condence bounds is given by the gaussian approximation to the binomial probability distribution as z ( k) § 1.96[z ( k) (1 ¡ z ( k) ) / n]1/ 2 (z ( k) § 2.575[z ( k) (1 ¡ z ( k) ) / n]1/ 2 ). To our knowledge, these local condence bounds for the Q-Q plots based on the beta distribution and the gaussian approximation are new. In general, the KS condence intervals will be wider than the corresponding Q-Q plot intervals. To see this, it sufces to compare the widths of the two intervals using their approximate formulas for large n. From the gaussian approximation to the binomial, the maximum width of the 95% condence interval for the Q-Q plots occurs at the median: z ( k) D 0.50 and is 2[1.96 / (4n ) 1 / 2 ] D 1.96n ¡1 / 2 . For n large, the width of the 95% condence intervals for the KS plots is 2.72n ¡1 / 2 at all quantiles. The KS condence bounds consider the maximum discrepancy from the 45-degree line along all quantiles; the 95% bands show the discrepancy that would be exceeded 5% of the time by chance if the plotted data were truly uniformly distributed. The Q-Q plot condence bounds consider the maximum discrepancy from the 45-degree line for each quantile separately. These pointwise 95% condence bounds mark the amount by which each value z ( k) would deviate from the true quantile 5% of the time purely by chance. The KS bounds are broad because they are based on the joint distribution of all n deviations, and they consider the distribution of the largest of these deviations. The Q-Q plot bounds are narrower because they measure the deviation at each quantile separately. Used together, the two plots help approximate upper and lower limits on the discrepancy between a proposed model and a spike train data series. 3 Applications 3.1 An Analysis of Supplementary Eye Field Recordings. For the rst application of the time-rescaling theorem to a goodness-of-t analysis, we analyze a spike train recorded from the supplementary eye eld (SEF) of a macaque monkey. Neurons in the SEF play a role in oculomotor processes (Olson, Gettner, Ventura, Carta, & Kass, 2000). A standard paradigm for studying the spiking properties of these neurons is a delayed eye movement task. In this task, the monkey xates, is shown locations of potential target sites, and is then cued to the specic target to which it must saccade. Next, a preparatory cue is given, followed a random time later by a go signal. Upon receiving the go signal, the animal must saccade to the specic target and hold xation for a dened amount of time in order to receive a reward. Beginning from the point of the specic target cue, neural activity is recorded for a xed interval of time beyond the presentation of the go signal. After a brief rest period, the trial is repeated. Multiple trials from an experiment
The Time-Rescaling Theorem
333
such as this are jointly analyzed using a PSTH to estimate ring rate for a nite interval following a xed initiation point. That is, the trials are time aligned with respect to a xed initial point, such as the target cue. The data across trials are binned in time intervals of a xed length, and the rate in each bin is estimated as the average number of spikes in the xed time interval. Kass and Ventura (2001) recently presented inhomogeneous Markov interval (IMI) models as an alternative to the PSTH for analyzing multiple-trial neural spike train data. These models use a Markov representation for the conditional intensity function. One form of the IMI conditional intensity function they considered is l(t | Ht ) D l(t | uN ( t ) , h ) D l1 ( t | h ) l2 ( t ¡ uN (t ) | h ) ,
(3.1)
where uN (t ) is the time of the last spike prior to t, l1 ( t | h ) modulates ring as a function of the experimental clock time, l2 ( t ¡ uN ( t ) | h ) represents Markov dependence in the spike train, and h is a vector of model parameters to be estimated. Kass and Ventura modeled log l1 (t | h ) and log l2 (t ¡ uN ( t) | h ) as separate piecewise cubic splines in their respective arguments t and t ¡ uN ( t ) . The cubic pieces were joined at knots so that the resulting functions were twice continuously differentiable. The number and positions of the knots were chosen in a preliminary data analysis. In the special case l(t | uN ( t) , h ) D l1 ( t | h ) , the conditional intensity function in equation 3.1 corresponds to an inhomogeneous Poisson (IP) model because this assumes no temporal dependence among the spike times. Kass and Ventura used their models to analyze SEF data that consisted of 486 spikes recorded during the 400 msec following the target cue signal in 15 trials of a delayed eye movement task (neuron PK166a from Olson et al., 2000, using the pattern condition). The IP and IMI models were t by maximum likelihood, and statistical signicance tests on the spline coefcients were used to compare goodness of t. Included in the analysis were spline models with higher-order dependence among the spike times than the rst-order Markov dependence in equation 3.1. They found that the IMI model gave a statistically signicant improvement in the t relative to the IP and that adding higher-order dependence to the model gave no further improvements. The t of the IMI model was not improved by including terms to model between-trial differences in spike rate. The authors concluded that there was strong rst-order Markov dependence in the ring pattern of this SEF neuron. Kass and Ventura did not provide an overall assessment of model goodness of t or evaluate how much the IMI and the IP models improved over the histogram-based rate model estimated by the PSTH. Using the KS and Q-Q plots derived from the time-rescaling theorem, it is possible to compare directly the ts of the IMI, IP, and PSTH models and to determine which gives the most accurate description of the SEF spike train structure. The equations for the IP rate function, lIP ( t | hIP ) D l1 ( t | hIP ),
334
E. N. Brown, R. Barbieri, V. Ventura, R. E. Kass, and L. M. Frank
and for the IMI conditional intensity function, lIMI ( t | uN (t ) , hIMI ) D l1 ( t | hIMI ) l2 ( t ¡ uN ( t ) , hIMI ) , are given in the appendix, along with a discussion of the maximum likelihood procedure used to estimate the coefcients of the spline basis elements. The estimated conditional intensity functions for the IP and IMI models are, respectively, lIP (t |, hOIP ) and lIMI ( t | uN ( t ) , hOIMI ) , where hO is the maximum likelihood estimate of the specic spline coefcients. For the PSTH model, the conditional intensity estimate is the PSTH computed by averaging the number of spikes in each of 40 10 msec bins (the 400 msec following the target cue signal) across the 15 trials. The PSTH is the t of another inhomogeneous Poisson model because it assumes no dependence among the spike times. The results of the IMI, IP, and PSTH model ts compared by KS and Q-Q plots are shown in Figure 1. For the IP model, there is lack of t at lower quantiles (below 0.25) because in that range, its KS plot lies just outside the 95% condence bounds (see Figure 1A). From quantile 0.25 and beyond, the IP model is within the 95% condence bounds, although be-
The Time-Rescaling Theorem
335
yond the quantile 0.75, it slightly underpredicts the true probability model of the data. The KS plot of the PSTH model is similar to that of the IP except that it lies entirely outside the 95% condence bands below quantile 0.50. Beyond this quantile, it is within the 95% condence bounds. The KS plot of the PSTH underpredicts the probability model of the data to a greater degree in the upper range than the IP model. The IMI model is completely within the 95% condence bounds and lies almost exactly on the 45-degree line of complete agreement between the model and the data. The Q-Q plot analyses (see Figure 1B) agree with KS plot analyses with a few exceptions. The Q-Q plot analyses show that the lack of t of the IP and PSTH models is greater at the lower quantiles (0–0.50) than suggested by the KS plots. The Q-Q plots for the IP and PSTH models also show that the deviations of these two models near quantiles 0.80 to 0.90 are statistically signicant. With the exception of a small deviation below quantile 0.10, the Q-Q plot of the IMI lies almost exactly on the 45-degree line. Figure 1: Facing page. (A) Kolmogorov-Smirnov (K-S) plots of the inhomogeneous Markov interval (IMI) model, inhomogeneous Poisson (IP), and perstimulus time histogram (PSTH) model ts to the SEF spike train data. The solid 45-degree line represents exact agreement between the model and the data. The dashed 45-degree lines are the 95% condence bounds for exact agreement between the model and experimental data based on the distribution of the 1 Kolmogorov-Smirnov statistic. The 95% condence bounds are b k § 1.36n ¡ 2 , where bk D ( k ¡ 12 ) / n for k D 1, . . . , n and n is the total number of spikes. The IMI model (thick, solid line) is completely within the 95% condence bounds and lies almost exactly on the 45-degree line. The IP model (thin, solid line) has lack of t at lower quantiles ( < 0.25). From the quantile 0.25 and beyond, the IP model is within the 95% condence bounds. The KS plot of the PSTH model (dotted line) is similar to that of the IP model except that it lies outside the 95% condence bands below quantile 0.50. Beyond this quantile, it is within the 95% condence bounds, yet it underpredicts the probability model of the data to a greater degree in this range than the IP model does. The IMI model agrees more closely with the spike train data than either the IP or the PSTH models. (B) Quantile-quantile (Q-Q) plots of the IMI (thick, solid line), IP (thin, solid line), and PSTH (dotted line) models. The dashed lines are the local 95% condence bounds for the individual quantiles computed from the beta probability density dened in equation 2.18. The solid 45-degree line represents exact agreement between the model and the data. The Q-Q plots suggests that the lack of t of the IP and PSTH models is greater at the lower quantiles (0–0.50) than suggested by the KS plots. The Q-Q plots for the IP and PSTH models also show that the deviations of these two models near quantiles 0.80 to 0.90 are statistically signicant. With the exception of a small deviation below quantile 0.10, the Q-Q plot of the IMI lies almost exactly on the 45-degree line.
336
E. N. Brown, R. Barbieri, V. Ventura, R. E. Kass, and L. M. Frank
We conclude that the IMI model gives the best description of these SEF spike trains. In agreement with the report of Kass and Ventura (2001), this analysis supports a rst-order Markov dependence among the spike times and not a Poisson structure, as would be suggested by either the IP or the PSTH models. This analysis extends the ndings of Kass and Ventura by showing that of the three models, the PSTH gives the poorest description of the SEF spike train. The IP model gives a better t to the SEF data than the PSTH model because the maximum likelihood analysis of the parametric IP model is more statistically efcient than the histogram (method of moments) estimate obtained from the PSTH (Casella & Berger, 1990). That is, the IP model t by maximum likelihood uses all the data to estimate the conditional intensity function at all time points, whereas the PSTH analysis uses only spikes in a specied time bin to estimate the ring rate in that bin. The additional improvement of the IMI model over the IP is due to the fact that the former represents temporal dependence in the spike train. 3.2 An Analysis of Hippocampal Place Cell Recordings. As a second example of using the time-rescaling theorem to develop goodness-of-t tests, we analyze the spiking activity of a pyramidal cell in the CA1 region of the rat hippocampus recorded from an animal running back and forth on a linear track. Hippocampal pyramidal neurons have place-specic ring (O’Keefe & Dostrovsky, 1971); a given neuron res only when the animal is in a certain subregion of the environment termed the neuron’s place eld. On a linear track, these elds approximately resemble one-dimensional gaussian surfaces. The neuron’s spiking activity correlates most closely with the animal’s position on the track (Wilson & McNaughton, 1993). The data series we analyze consists of 691 spikes from a place cell in the CA1 region of the hippocampus recorded from a rat running back and forth for 20 minutes on a 300 cm U-shaped track. The track was linearized for the purposes of this analysis (Frank, Brown, & Wilson, 2000). There are two approaches to estimating the place-specic ring maps of a hippocampal neuron. One approach is to use maximum likelihood to t a specic parametric model of the spike times to the place cell data as in Brown, Frank, Tang, Quirk, and Wilson (1998) and Barbieri, Quirk, et al. (2001). If x ( t) is the animal’s position at time t, we dene the spatial function for the one-dimensional place eld model as the gaussian surface
» ¼ b ( x (t) ¡ m ) 2 s (t ) D exp a ¡ , 2
(3.2)
where m is the center of place eld, b is a scale factor, and expfag is the maximum height of the place eld at its center. We represent the spike time probability density of the neuron as either an inhomogeneous gamma (IG)
The Time-Rescaling Theorem
337
model, dened as f ( u k | u k¡1 , h ) D
y s ( uk ) C (y )
µZ
uk
y s (u ) du
¶y ¡1
uk¡1
» Z exp ¡
uk
¼ y s ( u ) du ,
(3.3)
uk¡1
or as an inhomogeneous inverse gaussian (IIG) model, dened as f ( u k | u k¡1 , h ) s ( uk )
D µ
2p
hR uk uk¡1
s ( u ) du
8 ±R ²2 9 > < 1 uuk s (u ) du ¡ y > = k¡1 exp ¡ R uk , (3.4) 1 ¶ i3 2 > : 2 y 2 uk¡1 s ( u ) du > ;
where y > 0 is a location parameter for both models and h D (m , a, b, y ) is the set of model parameters to be estimated from the spike train. If we set y D 1 in equation 3.3, we obtain the IP model as a special case of the IG model. The parameters for all three models—the IP, IG, and the IIG—can be estimated from the spike train data by maximum likelihood (Barbieri, Quirk, et al., 2001). The models in equations 3.3 and 3.4 are Markov so that the current value of either the spike time probability density or the conditional intensity (rate) function depends on only the time of the previous spike. Because of equation 2.2, specifying the spike time probability density is equivalent to specifying the conditional intensity function. If we let hO denote the maximum likelihood estimate of h, then the maximum likelihood estimate of the conditional intensity function for each model can be computed from equation 2.2 as l(t | Ht , hO ) D
1¡
Rt
f (t | uN ( t) , hO )
uN (t)
f ( u | uN ( t) , hO ) du
.
(3.5)
for t > uN ( t ) . The estimated conditional intensity from each model may be used in the time-rescaling theorem to assess model goodness of t as described in Section 2.1. The second approach is to compute a histogram-based estimate of the conditional intensity function by using either spatial smoothing (Muller & Kubie, 1987; Frank et al., 2000) or temporal smoothing (Wood, Dudchenko, & Eichenbaum, 1999) of the spike train. To compute the spatial smoothing estimate of the conditional intensity function, we followed Frank et al. (2000) and divided the 300 cm track into 4.2 cm bins, counted the number of spikes per bin, and divided the count by the amount of time the animal spends in the bin. We smooth the binned ring rate with a six-point gaussian window with a standard deviation of one bin to reduce the effect of
338
E. N. Brown, R. Barbieri, V. Ventura, R. E. Kass, and L. M. Frank
running velocity. The spatial conditional intensity estimate is the smoothed spatial rate function. To compute the temporal rate function, we followed Wood et al. (1999) and divided the experiment into time bins of 200 msec and computed the rate as the number of spikes per 200 msec. These two smoothing procedures produce histogram-based estimates of l(t) . Both are histogram-based estimates of Poisson rate functions because neither the estimated spatial nor the temporal rate functions make any history-dependence assumption about the spike train. As in the analysis of the SEF spike trains, we again use the KS and Q-Q plots to compare directly goodness of t of the ve spike train models for the hippocampal place cells. The IP, IG, and IIG models were t to the spike train data by maximum likelihood. The spatial and temporal rate models were computed as described. The KS and Q-Q plot goodness-of-t comparisons are in Figure 2. The IG model overestimates at lower quantiles, underestimates at intermediate quantiles, and overestimates at the upper quantiles (see Figure 2A).
The Time-Rescaling Theorem
339
The IP model underestimates the lower and intermediate quantiles and overestimates the upper quantiles. The KS plot of the spatial rate model is similar to that of the IP model yet closer to the 45-degree line. The temporal rate model overestimates the quantiles of the true probability model of the data. This analysis suggests that the IG, IP, and spatial rate models are most likely oversmoothing this spike train, whereas the temporal rate model undersmooths it. Of the ve models, the one that is closest to the 45-degree line and lies almost entirely within the condence bounds is the IIG. This model disagrees only with the experimental data near quantile 0.80. Because all of the models with the exception of the IIG have appreciable lack of t in terms of the KS plots, the ndings in the Q-Q plot analyses are almost identical (see Figure 2B). As in the KS plot, the Q-Q plot for the IIG model is close to the 45-degree line and within the 95% condence bounds with the exception of quantiles near 0.80. These analyses suggest that IIG rate model gives the best agreement with the spiking activity of this pyramidal neuron. The spatial and temporal rate function models and the IP model have the greatest lack of t. 3.3 Simulating a General Point Process by Time Rescaling. As a second application of the time-rescaling theorem, we describe how the theorem may be used to simulate a general point process. We stated in the Introduction that the time-rescaling theorem provides a standard approach for simulating an inhomogeneous Poisson process from a simple Poisson Figure 2: Facing page. (A) KS plots of the parametric and histogram-based model ts to the hippocampal place cell spike train activity. The parametric models are the inhomogeneous Poisson (IP) (dotted line), the inhomogeneous gamma (IG) (thin, solid line), and the inhomogeneous inverse gaussian (IIG) (thick, solid line). The histogram-based models are the spatial rate model (dashed line) based on 4.2 cm spatial bin width and the temporal rate model (dashed and dotted line) based a 200 msec time bin width. The 45-degree solid line represents exact agreement between the model and the experimental data, and the 45-degree thin, solid lines are the 95% condence bounds based on the KS statistic, as in Figure 1. The IIG model KS plot lies almost entirely within the condence bounds, whereas all the other models show signicant lack of t. This suggests that the IIG model gives the best description of this place cell data series, and the histogram-based models agree least with the data. (B) Q-Q plots of the IP (dotted line), IG (thin, solid line), and IIG (thick, solid line) spatial (dashed line) temporal (dotted and dashed line) models. The 95% local condence bounds (thin dashed lines) are computed as described in Figure 1. The ndings in the Q-Q plot analyses are almost identical to those in the KS plots. The Q-Q plot for the IIG model is close to the 45-degree line and within the 95% condence bounds, with the exception of quantiles near 0.80. This analysis also suggests that IIG rate model gives the best agreement with the spiking activity of this pyramidal neuron. The spatial and temporal rate function models and the IP model have the greatest lack of t.
340
E. N. Brown, R. Barbieri, V. Ventura, R. E. Kass, and L. M. Frank
process. The general form of the time-rescaling theorem suggests that any point process with an integrable conditional intensity function may be simulated from a Poisson process with unit rate by rescaling time with respect to the conditional intensity (rate) function. Given an interval (0, T], the simulation algorithm proceeds as follows: 1. Set u 0 D 0; Set k D 1.
2. Draw t k an exponential random variable with mean 1. R k 3. Find u k as the solution to t k D uuk¡1 l(u | u0 , u1 , . . . , u k¡1 ) du. 4. If u k > T, then stop. 5. k D k C 1 6. Go to 2.
By using equation 2.3, a discrete version of the algorithm can be constructed as follows. Choose J large, and divide the interval (0, T] into J bins each of width D D T / J. For k D 1, . . . , J draw a Bernoulli random variable u¤k with probability l(kD | u¤1 , . . . , u¤k¡1 ) D , and assign a spike to bin k if u¤k D 1, and no spike if u¤k D 0. While in many instances there will be faster, more computationally efcient algorithms for simulating a point process, such as model-based methods for specic renewal processes (Ripley, 1987) and thinning algorithms (Lewis & Shedler, 1978; Ogata, 1981; Ross, 1993), the algorithm above is simple to implement given a specication of the conditional intensity function. For an example of where this algorithm is crucial for point process simulation, we consider the IG model in equation 3.2. Its conditional intensity function is innite immediately following a spike if y < 1. If in addition, y is time varying (y D y ( t) < 1 for all t), then neither thinning nor standard algorithms for making draws from a gamma probability distribution may be used to simulate data from this model. The thinning algorithm fails because the conditional intensity function is not bounded, and the standard algorithms for simulating a gamma model cannot be applied because y is time varying. In this case, the time-rescaling simulation algorithm may be applied as long as the conditional intensity function remains integrable as y varies temporally. 4 Discussion
Measuring how well a point process model describes a neural spike train data series is imperative prior to using the model for making inferences. The time-rescaling theorem states that any point process with an integrable conditional intensity function may be transformed into a Poisson process with a unit rate. Berman (1983) and Ogata (1988) showed that this theorem may be used to develop goodness-of-t tests for point process models of seismologic data. Goodness-of-t methods for neural spike train models based on
The Time-Rescaling Theorem
341
time-rescaling transformations but not the time-rescaling theorem have also been reported (Reich, Victor, & Knight, 1998; Barbieri, Frank, Quirk, Wilson, & Brown, 2001). Here, we have described how the time-rescaling theorem may be used to develop goodness-of-t tests for point process models of neural spike trains. To illustrate our approach, we analyzed two types of commonly recorded spike train data. The SEF data are a set of multiple short (400 msec) series of spike times, each measured under identical experimental conditions. These data are typically analyzed by a PSTH. The hippocampus data are a long (20 minutes) series of spike time recordings that are typically analyzed with either spatial or temporal histogram models. To each type of data we t both parametric and histogram-based models. Histogram-based models are popular neural data analysis tools because of the ease with which they can be computed and interpreted. These apparent advantages do not override the need to evaluate the goodness of t of these models. We previously used the time-rescaling theorem to assess goodness of t for parametric spike train models (Olson et al., 2000; Barbieri, Quirk, et al., 2001; Ventura et al., 2001). Our main result in this article is that the time-rescaling theorem can be used to evaluate goodness of t of parametric and histogram-based models and to compare directly the accuracy of models from the two classes. We recommend that before making an inference based on either type of model, a goodness-of-t analysis should be performed to establish how well the model describes the spike train data. If the model and data agree closely, then the inference is more credible than when there is signicant lack of t. The KS and Q-Q plots provide assessments of overall goodness of t. For the models t by maximum likelihood, these assessments can be applied along with methods that measure the marginal value and marginal costs of using more complex models, such as Akaikie’s information criterion (AIC) and the Bayesian information criterion (BIC), in order to gain a more complete evaluation of model agreement with experimental data (Barbieri, Quirk, et al., 2001). We assessed goodness of t by using the time-rescaling theorem to construct KS and Q-Q plots having, respectively, liberal and conservative condence bounds. Together, the two sets of condence bounds help characterize the range of agreement between the model and the data. For example, a model whose KS plot lies consistently outside the 95% KS condence bounds (the IP model for the hippocampal data) agrees poorly with the data. On the other hand, a model that is within all the 95% condence bounds of the Q-Q plots (the IMI model for the SEF data) agrees closely with the data. A model such as the IP model for the SEF data, that is, within nearly all the KS bounds, may lie outside the Q-Q plot intervals. In this case, if the lack of t with respect to the Q-Q plot intervals is systematic (i.e., is over a set of contiguous quantiles), this suggests that the model does not t the data well. As a second application of the time-rescaling theorem, we presented an algorithm for simulating spike trains from a point process given its condi-
342
E. N. Brown, R. Barbieri, V. Ventura, R. E. Kass, and L. M. Frank
tional intensity (rate) function. This algorithm generalizes the well-known technique of simulating an inhomogeneous Poisson process by rescaling a Poisson process with a constant rate. Finally, to make the reasoning behind the time-rescaling theorem more accessible to the neuroscience researchers, we proved its general form using elementary probability arguments. While this elementary proof is most certainly apparent to experts in probability and point process theory (D. Brillinger, personal communication; Guttorp, 1995) its details, to our knowledge, have not been previously presented. The original proofs of this theorem use measure theory and are based on the martingale representation of point processes (Meyer, 1969; Papangelou, 1972; BrÂemaud, 1981; Jacobsen, 1982). The conditional intensity function (see equation 2.1) is dened in terms of the martingale representation. Our proof uses elementary arguments because it is based on the fact that the joint probability density of a set of point process observations (spike train) has a canonical representation in terms of the conditional intensity function. When the joint probability density is represented in this way, the Jacobian in the change of variables between the original spike times and the rescaled interspike intervals simplies to a product of the reciprocals of the conditional intensity functions evaluated at the spike times. The proof also highlights the signicance of the conditional intensity function in spike train modeling; its specication completely denes the stochastic structure of the point process. This is because in a small time interval, the product of the conditional intensity function and the time interval denes the probability of a spike in that interval given the history of the spike train up to that time (see equation 2.3). When there is no history dependence, the conditional intensity function is simply the Poisson rate function. An important consequence of this simplication is that unless history dependence is specically included, then histogram-based models, such as the PSTH, and the spatial and temporal smoothers are implicit Poisson models. In both the SEF and hippocampus examples, the histogram-based models gave poor ts to the spike train. These poor ts arose because these models used few data points to estimate many parameters and because they do not model history dependence in the spike train. Our parametric models used fewer parameters and represented temporal dependence explicitly as Markov. Point process models with higher-order temporal dependence have been studied by Ogata (1981, 1988), Brillinger (1988), and Kass and Ventura (2001) and will be considered further in our future work. Parametric conditional intensity functions may be estimated from neural spike train data in any experiments where there are enough data to estimate reliably a histogram-based model. This is because if there are enough data to estimate many parameters using an inefcient procedure (histogram / method of moments), then there should be enough data to estimate a smaller number of parameters using an efcient one (maximum likelihood). Using the K-S and Q-Q plots derived from the time-rescaling theorem, it is possible to devise a systematic approach to the use of histogram-based
The Time-Rescaling Theorem
343
models. That is, it is possible to determine when a histogram-based model accurately describes a spike train and when a different model class, temporal dependence, or the effects of covariates (e.g., the theta rhythm and the rat running velocity in the case of the place cells) should be considered and the degree of improvement the alternative models provide. In summary, we have illustrated how the time-rescaling theorem may be used to compare directly goodness of t of parametric and histogram-based point process models of neural spiking activity and to simulate spike train data. These results suggest that the time-rescaling theorem can be a valuable tool for neural spike train data analysis. Appendix A.1 Maximum Likelihood Estimation of the IMI Model. To t the IMI model, we choose J large, and divide the interval (0, T] into J bins of width D D T / J. We choose J so that there is at most one spike in any bin. In this way, we convert the spike times 0 < u1 < u2 < , . . . , < un¡1 < un · T into a binary sequence uj¤ , where uj¤ D 1 if there is a spike in bin j and 0 otherwise for j D 1, . . . , J. By denition of the conditional intensity function for the IMI model in equation 3.1, it follows that each uj¤ is a Bernoulli random variable with the probability of a spike at jD dened as l1 ( jD | h )l2 ( jD ¡u¤N ( jD ) | h ) D . We note that this discretization is identical to the one used to construct the discretized version of the simulation algorithm in Section 3.3. The log probability of a spike is thus
log(l1 ( jD | h ) ) C log(l2 ( jD ¡ u¤N ( jD ) | h ) D ) ,
(A.1)
and the cubic spline models for log(l1 ( jD | h ) ) and log(l2 ( jD ¡ u¤N ( jD ) | h ) ) are, respectively, log(l1 ( jD | h ) ) D
3 X l D1
hl ( jD ¡ j1 ) lC
Ch4 ( jD ¡ j2 ) 3C C h5 ( jD ¡ j3 ) 3C
log(l2 ( jD ¡ u¤N ( jD ) | h ) ) D
3 X lD 1
(A.2)
hl C 5 ( jD ¡ u¤N ( jD ) ¡ c 1 ) lC
C h9 ( jD ¡ u¤N ( jD ) ¡ c 2 ) 3C ,
(A.3)
where h D (h1 , . . . , h9 ) and the knots are dened as jl D lT / 4 for l D 1, 2, 3 and c l is the observed 100l / 3 quantile of the interspike interval distribution of the trial spike trains for l D 1, 2. We have for Section 3.1 that hIP D (h1 , . . . , h5 )
344
E. N. Brown, R. Barbieri, V. Ventura, R. E. Kass, and L. M. Frank
and hIMI D (h1 , . . . , h9 ) are the coefcients of the spline basis elements for the IP and IMI models, respectively. The parameters are estimated by maximum likelihood with D D 1 msec using the gam function in S-PLUS (MathSoft, Seattle) as described in Chapter 11 of Venables and Ripley (1999). The inputs to gam are the uj¤ s, the time arguments jD and jD ¡ u¤N ( jD ) , the symbolic specication of the spline models, and the knots. Further discussion on spline-based regression methods for analyzing neural data may be found in Olson et al. (2000) and Ventura et al. (2001). Acknowledgments
We are grateful to Carl Olson for allowing us to use the supplementary eye eld data and Matthew Wilson for allowing us to use the hippocampal place cell data. We thank David Brillinger and Victor Solo for helpful discussions and the two anonymous referees whose suggestions helped improve the exposition. This research was supported in part by NIH grants CA54652, MH59733, and MH61637 and NSF grants DMS 9803433 and IBN 0081548. References Barbieri, R., Frank, L. M., Quirk, M. C., Wilson, M. A., & Brown, E. N. (2001). Diagnostic methods for statistical models of place cell spiking activity. Neurocomputing, 38–40, 1087–1093. Barbieri, R., Quirk, M. C., Frank, L. M., Wilson, M. A., & Brown, E. N. (2001). Construction and analysis of non-Poisson stimulus-response models of neural spike train activity. J. Neurosci. Meth., 105, 25–37. Berman, M. (1983). Comment on “Likelihood analysis of point processes and its applications to seismological data” by Ogata. Bulletin Internatl. Stat. Instit., 50, 412–418. BrÂemaud, P. (1981). Point processes and queues: Martingale dynamics. Berlin: Springer-Verlag. Brillinger, D. R. (1988). Maximum likelihood analysis of spike trains of interacting nerve cells. Biol. Cyber., 59, 189–200. Brown, E. N., Frank, L. M., Tang, D., Quirk, M. C., & Wilson, M. A. (1998). A statistical paradigm for neural spike train decoding applied to position prediction from ensemble ring patterns of rat hippocampal place cells. Journal of Neuroscience, 18, 7411–7425. Casella, G., & Berger, R. L. (1990). Statistical inference. Belmont, CA: Duxbury. Daley, D., & Vere-Jones, D. (1988). An introduction to the theory of point process. New York: Springer-Verlag. Frank, L. M., Brown, E. N., & Wilson, M. A. (2000). Trajectory encoding in the hippocampus and entorhinal cortex. Neuron, 27, 169–178. Guttorp, P. (1995). Stochastic modeling of scientic data. London: Chapman & Hall.
The Time-Rescaling Theorem
345
Hogg, R. V., & Tanis, E. A. (2001). Probability and statistical inference (6th ed.). Englewood Cliffs, NJ: Prentice Hall. Jacobsen, M. (1982). Statistical analysis of counting processes. New York: SpringerVerlag. Johnson, A., & Kotz, S. (1970). Distributions in statistics: Continuous univariate distributions—2. New York: Wiley. Kalbeisch, J., & Prentice, R. (1980). The statisticalanalysis of failure time data. New York: Wiley. Karr, A. (1991). Point processes and their statistical inference (2nd ed.). New York: Marcel Dekker. Kass, R. E., & Ventura, V. (2001).A spike train probability model. Neural Comput., 13, 1713–1720. Lewis, P. A. W., & Shedler, G. S. (1978). Simulation of non-homogeneous Poisson processes by thinning. Naval Res. Logistics Quart., 26, 403–413. Meyer, P. (1969).DÂemonstration simpliÂee d’un thÂeor`eme de Knight. In SÂeminaire probabilitÂe V (pp. 191–195). New York: Springer-Verlag. Muller, R. U., & Kubie, J. L. (1987). The effects of changes in the environment on the spatial ring of hippocampal complex-spike cells. Journal of Neuroscience, 7, 1951–1968. Ogata, Y. (1981). On Lewis’ simulation method for point processes. IEEE Transactions on Information Theory, IT-27, 23–31. Ogata, Y. (1988). Statistical models for earthquake occurrences and residual analysis for point processes. Journal of American Statistical Association, 83, 9– 27. O’Keefe, J., & Dostrovsky, J. (1971). The hippocampus as a spatial map: Preliminary evidence from unit activity in the freely-moving rat. Brain Res., 34, 171–175. Olson, C. R., Gettner, S. N., Ventura, V., Carta, R., & Kass, R. E. (2000). Neuronal activity in macaque supplementary eye eld during planning of saccades in response to pattern and spatial cues. Journal of Neurophysiology,84, 1369–1384. Papangelou, F. (1972). Integrability of expected increments of point processes and a related random change of scale. Trans. Amer. Math. Soc., 165, 483–506. Port, S. (1994). Theoretical probability for applications. New York: Wiley. Protter, M. H., & Morrey, C. B. (1991). A rst course in real analysis (2nd ed.). New York: Springer-Verlag. Reich, D., Victor, J., & Knight, B. (1998). The power ratio and the interval map: Spiking models and extracellular recordings. Journal of Neuroscience, 18, 10090–10104. Ripley, B. (1987). Stochastic simulation. New York: Wiley. Ross, S. (1993). Introduction to probability models (5th ed.). San Diego, CA: Academic Press. Snyder, D., & Miller, M. (1991). Random point processes in time and space (2nd ed.). New York: Springer-Verlag. Taylor, H. M., & Karlin, S. (1994). An introduction to stochastic modeling (rev. ed.). San Diego, CA: Academic Press. Venables, W., & Ripley, B. (1999). Modern applied statistics with S-PLUS (3rd ed.). New York: Springer-Verlag.
346
E. N. Brown, R. Barbieri, V. Ventura, R. E. Kass, and L. M. Frank
Ventura, V., Carta, R., Kass, R., Gettner, S., & Olson, C. (2001). Statistical analysis of temporal evolution in single-neuron ring rate. Biostatistics. In press. Wilson, M. A., & McNaughton, B. L. (1993). Dynamics of the hippocampal ensemble code for space. Science, 261, 1055–1058. Wood, E. R., Dudchenko, P. A., & Eichenbaum, H. (1999). The global record of memory in hippocampal neuronal activity. Nature, 397, 613–616.
Received October 27, 2000; accepted May 2, 2001.
LETTER
Communicated by Carson Chow
The Impact of Spike Timing Variability on the Signal-Encoding Performance of Neural Spiking Models Amit Manwani
[email protected] Peter N. Steinmetz
[email protected] Christof Koch
[email protected] Computation and Neural Systems, California Institute of Technology, Pasadena, CA 91125, U.S.A. It remains unclear whether the variability of neuronal spike trains in vivo arises due to biological noise sources or represents highly precise encoding of temporally varying synaptic input signals. Determining the variability of spike timing can provide fundamental insights into the nature of strategies used in the brain to represent and transmit information in the form of discrete spike trains. In this study, we employ a signal estimation paradigm to determine how variability in spike timing affects encoding of random time-varying signals. We assess this for two types of spiking models: an integrate-and-re model with random threshold and a more biophysically realistic stochastic ion channel model. Using the coding fraction and mutual information as information-theoretic measures, we quantify the efcacy of optimal linear decoding of random inputs from the model outputs and study the relationship between efcacy and variability in the output spike train. Our ndings suggest that variability does not necessarily hinder signal decoding for the biophysically plausible encoders examined and that the functional role of spiking variability depends intimately on the nature of the encoder and the signal processing task; variability can either enhance or impede decoding performance. 1 Introduction
Deciphering the neural code remains an essential and yet elusive key to understanding how brains work. Unraveling the nature of representation of information in the brain requires an understanding of the biophysical constraints that limit the temporal precision of neural spike trains, the dominant mode of communication in the brain (Mainen and Sejnowski, 1995; van Steveninck, Lewen, Strong, Koberle, & Bialek, 1997). The representation used by the nervous system depends on the precision with which neurons Neural Computation 14, 347–367 (2001)
° c 2001 Massachusetts Institute of Technology
348
A. Manwani, P. N. Steinmetz, and C. Koch
respond to their synaptic inputs (Theunissen & Miller, 1995), which in turn is inuenced by noise present at the single cell level (Koch, 1999). Neuronal hardware inherently behaves in a probabilistic manner, and thus the encoding of information in the form of spike trains is noisy and may result in irregular timing of individual action potentials in response to identical inputs (Schneidman, Freedman, & Segev, 1998). As in other physical systems, noise has a direct bearing on how information is represented, transmitted, and decoded in biological information processing systems (Cecchi et al., 2000), and a quantitative understanding of neuronal noise sources and their effect on the variability of spike timing reveals the constraints under which neuronal codes must operate. The variability of spike timing observed in vivo can arise due to a variety of factors. One possibility is that it originates at the level of the single neuron, due to either a delicate balance of the excitatory and inhibitory synaptic inputs it receives (Shadlen & Newsome, 1998) or sources of biological noise intrinsic to it (Schneidman et al., 1998). The other possibility is that variability is an emergent property of large, recurrent networks of spiking neurons connected in a certain fashion, representing faithful encoding of nonlinear or chaotic network dynamics (van Vreeswijk & Sompolinsky, 1996, 1998). Such faithful encoding argues in favor of the hypothesis that single neurons are capable of very precise signaling; high spike timing reliability is intuitively appealing since it provides a substrate for efcient temporal coding (Abeles, 1990; Bialek, Rieke, van Steveninck, & Warland, 1991; Softky & Koch, 1993). In this study, we eschew the debate regarding the origin of variability and instead assess the functional role of spike timing variability in a specic instance of a neural coding problem: signal estimation. We quantify the ability of two types of neural spiking models, integrate-and-re models and stochastic ion channel models, to encode information about their random time-varying inputs. The goal in signal estimation is to estimate a random time-varying current injected into a spike-encoding model from the corresponding spike train output. The efcacy of encoding is estimated by the ability to reconstruct the inputs from the output spike trains using optimal least-mean-square estimation; we use information-theoretic measures to quantify the fraction of variability in output spike train, which conveys information about the input. We have previously reported the coding fraction for the Hodgkin-Huxley dynamics for a limited number of input bandwidths (Steinmetz, Manwani, & Koch, in press); here we report the results for more biophysically realistic encoders and for a more complete range of model parameters. 2 Methods
In the following, we consider spiking models that transform continuous, time-varying input signals into sequences of action potentials or spikes
Signal Encoding in Noisy Spiking Models
349
Figure 1: Noisy models of spike timing variability. (A) For an adapting integrateand-re model with random threshold, the time-varying input current m ( t) is integrated by a combination of the passive membrane resistance and capacitance (the RC circuit) to give rise to the membrane voltage Vm . When Vm exceeds a threshold Vth drawn from a random distribution p ( Vm ) , a spike is generated and the integrator is reset for a duration equal to the refractory period tref. The output spike train of the model in response to the input is represented as a point process s (t ) . p ( Vth ) is modeled as an nth-order gamma distribution where n determines the variability in spike timing (the inset shows gamma distributions for n D 1, 2, and 10). Each spike increases the amplitude of a conductance gadapt by an amount Ginc . gadapt corresponds to a calcium-dependent potassium conductance responsible for ring-rate adaptation and decays exponentially to zero between spikes with a time constant tadapt. (B) A time-varying current input m ( t) is injected into a membrane patch containing stochastic voltage-gated ion channels, which are capable of generating action potentials in response to adequately strong current inputs. When the membrane voltage exceeds an arbitrarily chosen reference value above resting potential (+10 mV, in this case), a spike is recorded in the output spike train s ( t ) . Parameters correspond to the kinetic model for regular spiking cortical neurons derived by Golomb and Amitai (1997). (C, D) Sample traces of the input m ( t) , the membrane voltage Vm ( t) , and the spike train s (t ) for the models in A and B, respectively.
(shown in Figure 1). The spike train output of a model in response to the injection of an input current i ( t) is denoted by s ( t) , which is assumed to be a point process and is mathematically modeled as a sequence of delta
350
A. Manwani, P. N. Steinmetz, and C. Koch
functions at time instants fti g s ( t) D
X i
d ( t ¡ ti ) .
The models we consider generate irregular spike trains in response to repeated presentations of the same input and can be regarded as representations of irregular spiking behavior in real biological neurons. We use a specic signal processing task (signal estimation) to study the effects of spike timing variability on the encoding of time-varying input modulations by these models. 2.1 Measurement of Spike Timing Variability. Spike timing variability has previously been examined using two methods. The rst, introduced by Mainen and Sejnowski (1995), measures the precision and reliability of spike times generated by one neuron in response to repeated presentations of the same input current. The second method, measurement of the coefcient of variation (CV) of the interspike interval distribution, examines variability of ring when a neuron is responding to natural inputs. Thus, these two measures represent two different paradigms of neuronal variability. CV is the measure used here, since determining the efcacy of information transfer during signal encoding requires the presentation of a large group of randomly selected stimuli. Although there is no general relationship between precision and reliability as measured by Mainen and Sejnowski (1995) for the encoding models examined in this study, the two measures roughly correspond, as shown in Figure 2C. We will return to this point in the Discussion. 2.2 Measurement of Coding Efciency. In order to compute coding efciency, we construct the optimal linear estimator of the input current i (t) . The difference, m (t ), from the mean current is a zero-mean random timevarying input signal that is encoded in the form of a spike train s ( t) . For the purposes of this article, we assume that the current i (t) injected into the model has the form i (t) D I C m ( t) , where I is the constant component and m ( t) is the uctuating component of the injected current. We assume that weak-sense stationary (WSS) processes m ( t) and s ( t) are (real-valued) « ¬ jointly « ¬ 2 < 1, |s(t) ¡ l| 2 < 1, where l = hs(t) i with nite variances, m2 ( t) = sm is the mean ring rate of the neuron. The operator h ¢ i denotes an average over the joint input and spike train ensemble. The objective in signal estimation is to reconstruct the input m ( t) from the spike train s ( t) such that the mean square error (MSE) between m ( t) and its estimate is minimized. In general, the optimal MSE estimator is mathematically intractable to derive, so we shall restrict ourselves to the optimal O ( t) ), which can be written as linear estimator of the input (denoted by m
mO ( t) D ( g ? s ) ( t) .
(2.1)
Signal Encoding in Noisy Spiking Models
351
Figure 2: Block diagram of the signal estimation paradigm. (A) A noisy spikeencoding mechanism transforms a random time-varying input m ( t ) drawn from a probability distribution into a spike train s ( t ) . Techniques from statistical esO ( t ) of the input timation theory are used to derive the optimal linear estimate, m m ( t ) from the spike train s ( t ) . m ( t) is a gaussian, band-limited, wide sense stationary (WSS) stochastic process with a power spectrum Smm ( f ) that is at over a bandwidth Bm and whose standard deviation is denoted by sm . (B) Variability of the spike train is characterized by CV, the ratio of the standard deviation sT of the interspike intervals (Ti D tiC 1 ¡ ti ) to the mean interspike interval m T . The estimation performance is characterized by the coding fraction j D 1 ¡ E / sm . E is the mean-square error between the time-varying input m ( t) and its opO ( t) from the spike train s ( t ) . (C) Correspondence timal linear reconstruction m between CV and other measures of spike irregularity. Using the procedure described in Mainen and Sejnowski (1995), reliability and precision are estimated from responses of the model to repeated presentations of the same input. The spike sequences are used to obtain the poststimulus time histogram (PSTH) shown in the lowest trace. Instances when the PSTH exceeds a chosen threshold (dotted line) are termed events. Reliability is dened as the fraction of spikes occurring during these events, and precision is dened as the mean length of the events. The inverse relationship between reliability and CV (as the input bandwidth is varied) validates our use of CV as a representative measure of spike variability. A similar relationship exists between precision and CV (data not shown).
352
A. Manwani, P. N. Steinmetz, and C. Koch
The reconstruction noise nO ( t) for the estimation task is given as the difference between the input and the optimal estimate, O ( t) ¡ m (t ) , nO ( t) D m
(2.2)
« ¬ and the reconstruction error E is equal to the variance of nO ( t) , E D nO 2 . As O 2 · nO 2 for the opshown in Gabbiani (1996) and Gabbiani and Koch (1996), m timal linear estimator. Following their conventions, we dene a normalized dimensionless quantity, called the coding fraction j , as follows: j D1¡
E , 2 sm
0 · j · 1.
(2.3)
j can be regarded as a quantitative measure of estimation performance; j = 0 implies that the input modulations cannot be reconstructed at all, denoting estimation performance at chance, whereas j = 1 implies that the input modulations can be perfectly reconstructed. Another measure of estimation performance is the mutual information rate, denoted by I[m ( t) I s (t) ], between the random processes m (t ) and s ( t) (Cover & Thomas, 1991). The data processing inequality (Cover & Thomas, 1991) maintains that the mutual information rate between input m ( t) and the spike train s (t) is greater than the mutual information between m ( t) and its optimal estimate mO ( t), O (t )]. I[m ( t) I s (t )] ¸ I[m ( t)I m When the input m ( t) is gaussian, it can be shown (Gabbiani, 1996; Gabbiani & Koch, 1996) that I[m ( t) I mO ( t) ] (and thus I[m ( t)I s ( t) ]) is bounded below by I[m ( t) I mO ( t) ] ¸ ILB D
1 2
Z
1 ¡1
£ ¤ d f log2 SNR ( f )
(bit sec ¡1 ) ,
(2.4)
where SNR ( f ) is the signal-to-noise ratio dened as SNR ( f ) D
Smm ( f ) . Snˆ nˆ ( f )
(2.5)
Smm ( f ) and Snˆ nˆ are power spectral densities of the input m (t) and the noise n (t ). In our simulations, SNR ( f ) may be greater than one since we are separately adjusting the mean and variance of the input signal and the bandwidth of the noise. The lower bound on the information rate ILB lies in the set (0, 1) . The lower limit, ILB = 0, corresponds to chance performance (j = 0), whereas the upper limit, ILB D 1, corresponds to perfect estimation (j = 1). The information rate denotes the amount of information about the input (measured in units of bits) that can be reliably transmitted per second in
Signal Encoding in Noisy Spiking Models
353
the form of spike trains. Clearly, it depends on the rate at which spikes are generated: the higher the mean ring rate, the higher is the maximum amount of information that can be transmitted per second. Thus, in order to eliminate this extrinsic dependence on mean ring rate, we dene a quantity, IS D ILB / l, which measures the amount of information communicated per spike on average. IS is measured in units of (bits per spike). Thus, the coding fraction (j ) and the information rates (ILB , IS ) can be used to assess the ability of the spiking models to encode time-varying inputs in the specic context of the signal estimation (Schneidmann, Segev, & Tishby, 2000). 2.3 Models of Spike Encoding. We have previously reported the coding efciency for a noisy nonadapting integrate-and-re model, as well as for a stochastic version of the Hodgkin-Huxley kinetic scheme (Steinmetz et al., in press). A major goal of this work was to expand this analysis to use more biophysically realistic encoders.
2.3.1 Integrate-and-Fire model. Integrate-and-re models (I&F) are simplied, phenomenological descriptions of spiking behavior in biological neurons (Tuckwell, 1988). They retain two important aspects of neuronal ring: a subthreshold regime, where the input to the neuron is passively integrated, and a voltage threshold, which, when exceeded, leads to the generation of stereotypical spikes. Although I&F models are physiologically inaccurate, they are often used to model biological spike trains because of their analytical tractability. Real neurons show evidence of ring-rate adaptation; their ring rate decreases with time in response to constant, steady inputs. Such adaptation can be caused by processes like the release of neurotransmitters and neuromodulators and the presence of specic ionic currents (Ca2 C -dependent, slow K C ) among others. Wehmeier, Dong, Koch, & van Essen (1989) introduced an I&F model with a purely time-dependent shunting conductance, gadapt, with a reversal potential equal to the resting potential to account for shortterm (10–50 millisecond) adaptation. Each spike increases gadapt by a xed amount Ginc . Between spikes, gadapt decreases exponentially with a time constant tadapt. This models the effect of membrane Ca2 C -dependent potassium conductance, reproducing the effect of a relative refractory period following spike generation. We refer to this model as an adapting integrate-and-re model. An absolute refractory period tref is required in order to mimic very short-term adaptation. In the subthreshold domain, the membrane voltage Vm is given by C
Vm (1 C R gadapt ) dVm C D i (t) , dt R X dgadapt C gadapt D Ginc tadapt d ( t ¡ ti ). dt i
(2.6) (2.7)
354
A. Manwani, P. N. Steinmetz, and C. Koch
Table 1: Parameters for the Leaky Integrate-and-Fire Model. V th
16.4 mV
C
0.207 nF
R
38.3 MV
tref
2.68 msec
Ginc
20.4 nS
tadapt
52.3 msec
Dt
0.5 msec
When Vm reaches the voltage threshold Vth at time ti , a spike is generated if ti ¡ ti¡1 > tref , where ti¡1 is the time of the previous spike. When a spike is generated, gadapt ( ti ) is also incremented by Ginc . The model is completely characterized by six parameters: Vth , C, R, tref, Ginc , and tadapt . The values used for our simulations are adapted from Koch (1999) and are given in Table 1. Biological neurons in vivo show a substantial variability in the exact timing of action potentials to identical stimulus presentations (Calvin & Stevens, 1968; Softky & Koch, 1993; Shadlen & Newsome, 1998). A simple modication to reproduce the random nature of biological spike trains is to regard the voltage threshold Vth as a random variable drawn from some arbitrary probability distribution p ( Vth ) (Holden, 1976). We refer to this class as integrate-and-re models with random threshold. In general, p ( Vth ) can be arbitrary, but here we assume that it is an nth-order gamma distribution ³ pn (Vth ) D cn
Vth Vth
´n¡1
³ exp
¡nVth V th
´ ,
(2.8)
with cn D
1 nn , ( n ¡ 1) ! Vth
where Vth denotes the mean voltage threshold. The order of the distribution, n, determines the variability of spike trains in response to the injection of a constant current. Thus, one can obtain spike trains of varying regularity by modifying n. For constant current injection and in the absence of a refractory period, the CV varies from CV = 1 to CV = 0 as n is increased from n = 1 to n = 1 (corresponding to a deterministic threshold). A schematic diagram of the adapting I&F model with random threshold used in this article is shown in Figure 1A. 2.3.2 Stochastic Ion Channel Model. While a proper adjustment of model parameters allows I&F models to provide a fairly accurate description of
Signal Encoding in Noisy Spiking Models
355
the ring properties of some cortical neurons (Stevens & Zador, 1998; Koch, 1999), many neurons cannot be modeled by I&F models. Nerve membranes contain several voltage- and ligand-gated ionic currents, which are responsible for a variety of physiological properties that phenomenological models fail to capture. The successful elucidation of the ionic basis underlying neuronal excitability in the squid giant axon by Hodgkin and Huxley (1952) led to the development of more sophisticated mathematical models that described the initiation and propagation of action potentials by explicitly modeling the different ionic currents owing across a neuronal membrane. In the original Hodgkin and Huxley model, membrane currents were expressed in terms of macroscopic deterministic conductances representing the selective permeabilities of the membrane to different ionic species. However, it is now known that the macroscopic currents arise as a result of the summation of stochastic microscopic currents owing through a multitude of ion channels in the membrane. Ion channels have been modeled as nite-state Markov chains with state transition probabilities proportional to the kinetic transition rates between different conformational states (Skaugen & Wallœ, 1979; Clay & DeFelice, 1983; Strassberg & DeFelice, 1993). In earlier research, we studied the inuence of the stochastic nature of voltage-gated ion channels in excitable neuronal membranes on subthreshold membrane voltage uctuations (Steinmetz, Manwani, Koch, London, & Segev, 2000; Manwani, Steinmetz, & Koch, 2000). Here we are interested in assessing the inuence of variability in spike timing on the ability of noisy spiking mechanisms to encode time-varying inputs, in the context of stochastic ion channel models. For voltage-gated ion channels, the kinetic transition rates (and other parameters determined by them) are functions of the membrane voltage Vm . As in the case of the I&F model, a band-limited white noise current m ( t) is injected into a patch of membrane containing stochastic voltage-gated ion channels, and MonteCarlo simulations are carried out to determine the response of the model to random suprathreshold stimuli (see Figure 1B). The ion channel model we consider here is a stochastic counterpart of the single-compartment model of a regular-spiking cortical developed in Golomb and Amitai (1997). The original version consists of a fast sodium current, a persistent sodium current, a delayed-rectier potassium current, an A-type potassium current (for adaptation), a slow potassium current, a passive leak current, and excitatory synaptic (AMPA- and NMDA-type) currents. We are interested in the variability due to the stochastic nature of ion channels, and so here we assume that the synaptic currents are absent. In order to simulate stochastic Markov models of the ion channel kinetics associated with ve voltage-dependent ionic currents, we performed Monte Carlo simulations of single-compartmental models of membrane patches of area A. The Markov models correspond to equations A1 through A20 of Golomb and Amitai (1997) and were constructed using methods given in
356
A. Manwani, P. N. Steinmetz, and C. Koch
Table 2: Parameters for the Stochastic Ion Channel Model. NaC current (INa ) gNa c Na
2 states for h 0.24 nS / m 2 0.18 pS
Persistent NaC current (INaP )
Deterministic
Delayed rectier K C current (I Kdr )
5 states for n
gKdr c Kdr A-type K C current (IKA ) gKA c KA Slow K C current (IK¡slow ) gKsl c Ksl Leakage current (I L )
0.24 nS / m 2 0.21 pS 2 states for b 0.24 nS / m 2 0.020 pS 2 states for z 0.24 nS / m 2 0.20 pS Deterministic
Skaugen and Wallœ (1979), Skaugen (1980), and Steinmetz et al. (2000). The particular parameters used for these simulations are given in Table 2. We chose a sufciently small time step for the simulation so that the membrane voltage can be assumed to be relatively constant over the duration of the step. The voltage-dependent state transition probabilities are computed for each time step.1 Knowledge of the transition probabilities between states is used to determine the modied channel populations occupying different states. This is done by drawing random numbers specifying the number of channels making transitions between any two states from multinomial distributions parameterized by the transition probabilities. The membrane conductance due to a specic ion channel is determined by tracking the number of members of the given channel type that are in its open state. This membrane current is integrated over the time step to compute the membrane voltage for the next time step. This procedure is applied iteratively to obtain the membrane voltage trajectory in response to the input current waveform. For a detailed description of the Monte Carlo simulations, see Schneidman et al. (1998) and Steinmetz et al. (2000). The voltage trajectory is transformed into a point process by considering the instance of the voltage crossing a threshold (here, 10 mV with respect to resting potential) as a spike occurrence. Thresholding allows us to treat the output of the model as a sequence of spike times rather than as membrane voltage modulations. This simple recipe to detect spikes works well for the model we consider here. 1 The transition probabilities are computed by multiplying the corresponding rates by the length of the time step, assuming the product is much smaller than one.
Signal Encoding in Noisy Spiking Models
357
3 Results
We carried out simulations for the I&F model and the stochastic ion channel model and recorded the output spike times in response to the injection of a current equal to a mean current, I, plus pseudo-random, gaussian, band-limited noise (at power spectrum Smm ( f ) over bandwidth Bm ) ( f 2 (0, Bm ]). We then computed the CV of the interspike interval distribution and the coding fraction, j , for signal estimation task. CV measures the variability of the spike train in response to the input, whereas the coding fraction quanties the fraction of the variability in the spike train, which is functionally useful to reconstruct the input modulations. Once again, our goal is to understand how spike timing variability inuences performance in a specic biological information processing task—here, a signal estimation task. 3.1 Dependence of Variability on Firing Rate and Bandwidth. First, we explored the dependence of the coefcient of variability of the interspike interval, CV, on the mean ring rate l of the spiking models. Figure 3A shows the CV as a function of l for an input bandwidth of B m = 50 Hz. The mean ring rate l for the I&F model depends on only the mean injected current I, whereas for membrane patch–containing stochastic ion channels, l depends on a variety of additional parameters, such as the area of the patch A, the standard deviation of the stochastic input noise current sm , and bandwidth of the input Bm . For both models, we varied l while maintaining the contrast of the input, dened as c D sm / I, constant at c D 1 / 3. In both cases, CV increased monotonically with mean ring rate. When the contrast is kept constant, an increase in I (to increase the mean ring rate) requires a corresponding increase in the magnitude of the uctuations sm . This results in an increase in the amplitude of ring-rate modulations, allowing the input to be estimated more accurately. Figure 3B shows the CV for the two models as a function of the bandwidth of the input Bm . For both models, total noise power was held constant at all bandwidths, and I was adjusted so that the mean ring rate l was approximately equal to 50 Hz. In both cases, CV decreases with increasing Bm , which is in qualitative agreement with earlier experimental (Mainen & Sejnowski, 1995) and computational (Schneidman et al., 1998) ndings demonstrating an inverse relationship between spike timing precision (using a measure different from CV) and the temporal bandwidth of the input. Strong temporal dynamics in the input to a neuron dwarf the effect of the inherent noise in the spiking mechanism and regularize the spike train. Within the class of I&F models, models with higher n re more regularly and have lower CV values, as expected. 3.2 Dependence of Encoding Performance on Firing Rate and Bandwidth. Next, we explored the dependence of coding efciency on the mean
358
A. Manwani, P. N. Steinmetz, and C. Koch
Figure 3: Variability and coding efciency of spiking models. (A) CV of the interspike interval distribution of the spike train as a function of the mean ring rate of the spike train l. The input to the model is a gaussian, white, bandlimited (bandwidth Bm = 50 Hz) input, with mean I and standard deviation sm . The mean ring rate of the model is varied by changing the mean current I while maintaining the contrast of the input, dened as c D sm / I, constant (c = 1/3). The solid curves correspond to the adapting I&F model for different values of the order n of the gamma-distributed voltage threshold distribution (n D 1 corresponds to a deterministic threshold). The dotted curve corresponds to a 1000 m m2 membrane patch containing stochastic ion channels. (B) CV as a function of input bandwidth Bm . l for both the models was maintained at 50 Hz. (C, D) The dependence of the coding fraction j in the signal estimation task for the two types of spiking models on the mean ring rate l (for Bm = 50 Hz) and the input bandwidth Bm (for l = 50 Hz), respectively. Model parameters are summarized in the caption of Figure 1.
ring rate and the input bandwidth. In Figures 3C and 3D, the coding fraction j is plotted as a function of l (for Bm = 50 Hz) and Bm (l = 50 Hz), respectively. For both spike-encoding mechanisms, j increases with mean
Signal Encoding in Noisy Spiking Models
359
ring rate and decreases with input bandwidth. One interpretation of the previously observed decrease in variability of spike timing with input bandwidth (see Figure 3B) is that it suggests an improvement in coding—that in a sense, the neuron prefers higher bandwidths (Schneidman, Freedman, & Segev, 1997). By contrast, we nd that the noisy spike-encoding models considered here encode slowly varying stimuli more effectively than rapid ones. We believe that in the context of a signal estimation task, this is a generic property of all models that encode continuous signals as ring-rate modulations of sequences of discrete spike trains. 3.3 Dependence of Mutual Information on Firing Rate and Bandwidth.
Next, we explored the dependence of the mutual information rates on the mean ring rate and input bandwidth. Figures 4A and 4B, respectively, show that the lower bound of the mutual information rate ILB increases with l and Bm . This behavior can be better understood in the light of the phenomenological expression ILB D Bm log2 (1 C k l / Bm ), where k is a constant that depends on the details of the encoding scheme. The above expression is exact when the instantaneous ring rate of the model is a linear function of the input (as in the case of the perfect I&F model) and the input is a white band-limited gaussian signal with bandwidth B m . For a Poisson model without adaptation, k D c2 / 2. (Details of the derivation of this expression are provided in Manwani and Koch, 2001.) From the above expression, one can deduce that for low ring rates, ILB increases linearly with l, but at higher rates, ILB becomes logarithmic with l. One can also conclude that ILB increases with Bm for small bandwidths but quickly saturates at high bandwidths at the value k l / ln 2. This qualitatively agrees with Figures 4A and 4B. The dependence of the information rate per spike IS on l and Bm can be similarly explored. The expression for ILB is sublinear with respect to l, and thus one can deduce that IS should decrease monotonically with ring rate when the bandwidth Bm is held xed. In fact, its maximum value, IS D k , occurs at l D 0. On the other hand, when l is held xed, IS should increase with Bm initially but saturate at high bandwidths at k / ln 2. Once again, Figures 4C and 4D agree qualitatively with these predictions. 3.4 Relationship Between Variability and Encoding Performance. In order to understand further the role of variability in the context of signal estimation, we plot measures of coding efciency (j and ILB / Bm ) versus the corresponding CV values for the two models as different parameters are varied. Figures 5A and 5B show the dependence of coding performance on the variability of spike timing for the I&F model, and Figures 5C and 5D show the corresponding behaviors for the stochastic ion channel model. For both models, estimation performance improves with variability when the mean ring rate was increased or the input bandwidth was decreased. This implies that the variability in the output spike train rep-
360
A. Manwani, P. N. Steinmetz, and C. Koch
Figure 4: Information rates in signal estimation for spiking models. (A) Lower bounds of the information rate ILB for the two spiking model classes considered in this article. The solid curves correspond to the adapting I&F model for different values of n, and the dotted curve corresponds to the stochastic ion channel model. As in Figure 3, the input is a band-limited gaussian process with bandwidth Bm = 50 Hz. (B) ILB as a function of the input bandwidth Bm for l = 50 Hz. (C) The mutual information transmitted per spike on average, IS D ILB / l, as a function of l (Bm = 50 Hz). (D) IS as a function of input bandwidth Bm for l = 50 Hz. Model parameters are summarized in the caption of Figure 1.
resents faithful encoding of the input modulations, and thus greater variability leads to better signal estimation. On the other hand, when either the order of the gamma distribution for the I&F model or the area of the membrane patch for the stochastic ion channel model was decreased, coding performance decreased with variability. This suggests that the variability is due to noise (randomness of the spiking threshold). Thus, we nd that the effect of spike timing variability on the coding efciency of spiking models may be benecial or detrimental; the direction of its inuence depends on the specic nature of the signal processing task the neuron is expected to perform (signal estimation here) and the parameter that is varied.
Signal Encoding in Noisy Spiking Models
361
Figure 5: Is it signal or is it noise? Parametric relationships between measures of coding efciency and the variability of the spike train as different parameters were varied for the two spiking models. (A) Coding fraction j and (B) mutual information transmitted per input time constant ILB / Bm for the I&F model as a function of the CV of the spike train. Squares plot results while input bandwidth was varied between 10 and 150 Hz. Open circles plot results while the mean input was varied to change the ring rate from 40 to 92 Hz. Filled circles show results for the order of the gamma distribution of thresholds varied from 2 to innity. The increase in estimation performance with CV when the mean ring rate l was increased or the input bandwidth Bm was decreased (with n D 1 for the I&F model) suggests that the variability arises as a result of faithful encoding of the input and thus represents signal, whereas a decrease with CV when the order n of the threshold distribution was decreased suggests that the variability impedes encoding and thus represents noise. (C) Coding fraction j and (D) mutual information transmitted per input time constant ILB / Bm for the stochastic ion channel model as a function of the CV of the spike train as the mean ring rate l (A = 1000 m m2 , Bm = 50 Hz, open circles), input bandwidth Bm (A = 1000 m m2 , l = 50 Hz, open squares), and the area of the patch A (Bm = 50 Hz, l = 50 Hz, lled circles) were varied.
3.5 Mean Rate Code in Signal Estimation. Figures 6A and 6B demonstrate that for the spiking models we have considered here, performance in the signal estimation task is determined by the ratio l / Bm and not by
362
A. Manwani, P. N. Steinmetz, and C. Koch
Figure 6: Coding efciency as a function of l / Bm . (A) Coding fraction j and (B) mutual information transmitted per input time constant, ILB / Bm , for the two spiking models as a function of the mean number of spikes available per input time constant, l / Bm , for different combinations of Bm and l (empty symbols: Bm varied, empty symbols: l varied). The solid curves correspond to the adapting I&F model (different symbols represent different values of the order, n, of the voltage threshold gamma distribution), whereas the dotted curve corresponds to a 1000 m m2 membrane patch containing stochastic ion channels. The contrast of the input, c, was maintained at one-third.
the absolute values of l and Bm . The quantity l / Bm represents the number of spikes observed during an input time constant, a time interval over which the input is relatively constant. Thus, the larger the number of spikes available for the estimation task, the better the estimate of the neuron’s instantaneous ring rate l(t) and, consequently, the better the estimate of
Signal Encoding in Noisy Spiking Models
363
the instantaneous value of the input m ( t) . This suggests that the relevant variable that encodes the input modulations is the neuron’s (instantaneous) ring rate computed over time intervals of length 1 / Bm . Furthermore, the more action potential per time constant of encoding, the more efcient information can be estimated using mean rate codes. 4 Discussion
In this article, we use two types of noisy spiking models to study the inuence of spike timing variability on the ability of single neurons to encode time-varying signals in their output spike trains. Here we extend the preliminary results for simplied channel models reported previously (Steinmetz et al., in press) to more biophysically realistic models of spike encoding. For both I&F models with noisy thresholds and stochastic ion channel encoders, we nd that decreased spike timing variability, as assayed using the CV of the interspike interval distribution, does not necessarily translate to an increase in performance for all signal processing tasks. These results show that although the variability of spike timing decreases for these encoders as the bandwidth of the input is increased, the ability to estimate random continuous signals drops. Conversely, an increase in variability with ring rate also causes an increase in estimation performance. A similar connection between increased variability and increased signal detection performance is observed in systems that exhibit stochastic resonance (Chialvo, Dykman, & Millonas, 1995; Chialvo, Longtin, & MullerGerking, 1997; Collins, Chow, & Imhoff, 1995a, 1995b; Henry, 1999; Russell, Wilkens, & Moss, 1999). In these systems, the addition of noise both increases output variability and improves signal transduction. In the encoding models studied here, noise is added by the stochastic nature of the ion channels responsible for action potential production, which could evoke stochastic resonance for specic encoding tasks. For hippocampal CA1 cells, the addition of synaptic noise has recently been shown to evoke stochastic resonance when detecting periodic pulse trains (Stacey & Durand, 2000). The observed trends of CV as a function of ring rate (cf. Figure 3) are in agreement with those previously reported (Christodoulou & Bugmann, 2000; Tiesinga, Jose, & Sejnowski, 2000) and correspond to an encoder driven by Gaussian noise, but not to a system driven by a Poisson process (Tiesinga et al., 2000). We have previously shown that for subthreshold voltages, noise generated by stochastic channel models is well approximated by a gaussian distribution (Steinmetz et al., 2000); thus, the combination of these observations suggests that a gaussian distribution may function as a good approximation for suprathreshold effects as well. The trends of coding fraction as a function of input bandwidth are also in qualitative agreement with experimental measurements of the coding fraction in cortical pyramidal cells reported by Fellous et al. (2001), although
364
A. Manwani, P. N. Steinmetz, and C. Koch
there are differences in the input signal, which was sinusoidal for these experiments. While interpreting these results, a few limitations must be borne in mind. We measure coding performance for gaussian, white band-limited inputs in the context of a specic signal estimation paradigm. Generally, the computation performed by real neurons in the brain and the statistical properties of the milieu of signals in which they operate are difcult to determine. Estimation performance using white noise does not shed light on the operation of neural systems highly specialized to detect specic input patterns or those optimized to process natural and ecologically relevant signals with specic statistical properties. However, in the absence of knowledge regarding the role of a single neuron, the coding fraction for white noise stimuli represents a convenient metric to quantify its behavior. The second general limitation is that we employ simple linear decoding to recover the input from the spike train, which is inferior in performance to general nonlinear decoding mechanisms. However, it has been argued that when the mean rate or some function of it is the relevant encoding variable, the difference in performance between linear and nonlinear estimators is marginal (Rieke, Warland, van Steveninck, & Bialek, 1997) and the coding fraction is a good indicator of coding efciency. Finally, this study assumes that information is encoded at the level of single neurons. The investigation of the role of variability on the ability of a population of neurons to encode information is a signicantly more complicated and interesting problem since temporal synchrony between groups of neurons in a population may enhance or decrease coding efciency. We are currently investigating this issue. Earlier studies have shown that neurons re more predictably and precisely when their inputs have richer temporal structure (Mainen & Sejnowski, 1995; Schneidman et al., 1998). It has also been argued that the ability to alter the accuracy of their representation depending on the nature of their inputs may enable neurons to adapt to the statistical structure of their ecological environment and act as “smart encoders” (Schneidman et al., 1998; Brenner, Rieke, & de Ruyter van Steveninck, 2000; Cecchi et al., 2000). One example would be using a coarse rate code to encode slowly varying input signals but a ne temporal code to encode rapidly varying inputs. While we agree with the basic premise of the argument, viewing variability from a different paradigm, namely signal estimation, and assaying variability with CV, suggests that variability could represent the input signal in biophysically plausible encoders; thus, encoding efciency increases with increasing variability for several of the encoders examined here. For those encoders, spike timing reliability decreases with increasing CV, as shown in Figure 2, so in these cases, reliability will be decreasing as coding efciency increases, providing one counterexample wherein an increase in reliability does not lead to better performance in a signal reconstruction
Signal Encoding in Noisy Spiking Models
365
task. This leads us to argue that the role of variability depends intimately on the nature of the information processing task and the nature of the spike encoder. These results also highlight the need to further measure and understand biophysical noise sources and the mechanisms of computation in cortical neurons. Acknowledgments
This work was funded by the NSF Center for Neuromorphic Systems Engineering at Caltech, the NIMH, and the Sloan-Swartz Center for Theoretical Neuroscience. We thank Idan Segev, Michael London, and Elad Schneidman for their invaluable suggestions. References Abeles, M. (1990). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Bialek, W., Rieke, F., van Steveninck, R. R. D., & Warland, D. (1991). Reading a neural code. Science, 252, 1854–1857. Brenner, N., Bialek, W., & de Ruyter van Steveninck, R. (2000).Adaptive rescaling maximizes information transmission. Neuron, 26, 695–702. Calvin, W. H., & Stevens, C. F. (1968). Synaptic noise and other sources of randomness in motoneuron interspike intervals. J. Neurophysiol., 31, 574–587. Cecchi, G. A., Mariano, S., Alonso, J. M., Martinez, L., Chialvo, D. R., & Magnasco, M. O. (2000). Noise in neurons is message-dependent. PNAS, 97(10), 5557–5561. Chialvo, D. R., Dykman, M. I., & Millonas, M. M. (1995). Fluctuation-induced transport in a periodic potential: Noise versus chaos. Physical Review Letters, 78(8), 1605. Chialvo, D. R., Longtin, A., & Muller-Gerking, J. (1997). Stochastic resonance in models of neuronal ensembles. Physical Review E, 55(2), 1798–1808. Christodoulou, C., & Bugmann, G. (2000). Near Poisson-type ring produced by concurrent excitation and inhibition. Biosystems, 58, 41–48. Clay, J. R., & DeFelice, L. J. (1983). Relationship between membrane excitability and single channel open-close kinetics. Biophys. J., 42, 151–157. Collins, J. J., Chow, C. C., & Imhoff, T. T. (1995a). Aperiodic stochastic resonance in excitable systems. Physical Review. E. Statistical Physics, Plasmas, Fluids, and Related Interdisciplinary Topics, 52(4), R3321–R3324. Collins, J. J., Chow, C. C., & Imhoff, T. T. (1995b). Stochastic resonance without tuning. Nature, 376(6537), 236–238. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Fellous, J. M., Houweling, A. R., Modi, R. H., Rao, R. P. N., Tiesinga, P. H. E., & Sejnowski, T. J. (2001). Frequency dependence of spike timing reliability in cortical pyramidal cells and interneurons. Journal of Neurophysiology, 85, 1782–2001.
366
A. Manwani, P. N. Steinmetz, and C. Koch
Gabbiani, F. (1996). Coding of time-varying signals in spike trains of linear and half-wave rectifying neurons. Network: Comput. Neural Syst., 7, 61–85. Gabbiani, F., & Koch, C. (1996). Coding of time-varying signals in spike trains of integrate-and-re neurons with random threshold. Neural Comput., 8, 44–66. Golomb, D., & Amitai, Y. (1997). Propagating neuronal discharges in neocortical slices: Computational and experimental study. J. Neurophysiol.,78, 1199–1211. Henry, K. R. (1999). Noise improves transfer of near-threshold, phase-locked activity of the cochlear nerve: Evidence for stochastic resonance? J. Comp. Physiol. (A), 184(6), 577–584. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conductation and excitation in nerve. J. Physiol. (Lond.), 117, 500–544. Holden, A. V. (1976). Models of the stochastic activity of neurones. New York: Springer-Verlag. Koch, C. (1999). Biophysics of computation: Information processing in single neurons. New York: Oxford University Press. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Manwani, A., & Koch, C. (2001). Detecting and estimating signals over noisy and unreliable synapses: Information-theoretic analysis. Neural Comput., 13, 1–33. Manwani, A., Steinmetz, P. N., & Koch, C. (2000). Channel noise in excitable neuronal membranes. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processingsystems,12 (pp. 143–149). Cambridge, MA: MIT Press. Rieke, F., Warland, D., van Steveninck, R. D. R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Russell, D. F., Wilkens, L. A., & Moss, F. (1999). Use of behavioural stochastic resonance by paddle sh for feeding. Nature, 402(6759), 291–294. Schneidman, E., Freedman, B., & Segev, I. (1997). Ion-channel stochasticity may be a critical factor in determining the reliability of spike timing. Neurosci-L Supplement, 48, 543–544. Schneidman, E., Freedman, B., & Segev, I. (1998). Ion-channel stochasticity may be critical in determining the reliability and precision of spike timing. Neural Comput., 10, 1679–1703. Schneidmann, E., Segev, I., & Tishby, N. A. (2000). Information capacity and robustness of stochastic neuron models. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural informationprocessingsystems,12. Cambridge, MA: MIT Press. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18, 3870–3896. Skaugen, E. (1980). Firing behavior in stochastic nerve membrane models with different pore densities. Acta Physiol. Scand., 108, 49–60. Skaugen, E., & Wallœ, L. (1979). Firing behavior in a stochastic nerve membrane model based upon the Hodgkin-Huxley equations. Acta Physiol. Scand., 107, 343–363.
Signal Encoding in Noisy Spiking Models
367
Softky, W. R., & Koch, C. (1993). The highly irregular ring of cortical cells is inconsistent with temporal integration of random EPSPS. J. Neurosci., 13, 334–350. Stacey, W. C., & Durand, D. M. (2000). Stochastic resonance improves signal detection in hippocampal CA1 neurons. J. Neurophysiol., 83(3), 1394–1402. Steinmetz, P. N., Manwani, A., & Koch, C. (in press). Variability and coding efciency of noisy neural spike encoders. Biosystems. Steinmetz, P. N., Manwani, A., Koch, C., London, M., & Segev, I. (2000). Subthreshold voltage noise due to channel uctuations in active neuronal membranes. J. Comput. Neurosci., 9, 133–148. Stevens, C. F., & Zador, A. M. (1998). Novel integrate-and-re-like model of repetitive ring in cortical neurons. In: Proceedings of the 5th Joint Symposium on Neural Computation (La Jolla, CA). Strassberg, A. F., & DeFelice, L. J. (1993). Limitations of the Hodgkin-Huxley formalism: Effect of single channel kinetics on transmembrane voltage dynamics. Neural Comput., 5, 843–855. Theunissen, F., & Miller, J. P. (1995). Temporal encoding in nervous systems: A rigorous denition. J. Comput. Neurosci., 2, 149–162. Tiesinga, P. H. E., Jose, J. V., & Sejnowski, T. J. (2000). Comparison of currentdriven and conductance-driven neocortical model neurons with HodgkinHuxley voltage-gaged channels. Physical Review E, 62(6), 8413–8419. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology II: Nonlinear and stochastic theories. Cambridge: Cambridge University Press. van Steveninck, R. R. D., Lewen, G. D., Strong, S. P., Koberle, R., & Bialek, W. (1997). Reproducibility and variability in neural spike trains. Science, 275, 1805–1808. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274(5293), 1724–1726. van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Comput., 10, 1321–1371. Wehmeier, U., Dong, D., Koch, C., & van Essen, D. (1989). Modeling the mammalian visual system. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling. Cambridge, MA: MIT Press. Received June 20, 2000; accepted May 2, 2001.
LETTER
Communicated by Wulfram Gerstner
Temporal Correlations in Stochastic Networks of Spiking Neurons Carsten Meyer
[email protected] Carl van Vreeswijk
[email protected] Racah Institute of Physics and Center for Neural Computation, Hebrew University, Jerusalem 91904, Israel The determination of temporal and spatial correlations in neuronal activity is one of the most important neurophysiological tools to gain insight into the mechanisms of information processing in the brain. Its interpretation is complicated by the difculty of disambiguating the effects of architecture, single-neuron properties, and network dynamics. We present a theory that describes the contribution of the network dynamics in a network of “spiking” neurons. For a simple neuron model including refractory properties, we calculate the temporal cross-correlations in a completely homogeneous, excitatory, fully connected network in a stable, stationary state, for stochastic dynamics in both discrete and continuous time. We show that even for this simple network architecture, the crosscorrelations exhibit a large variety of qualitatively different properties, strongly dependent on the level of noise, the decay constant of the refractory function, and the network activity. At the critical point, the crosscorrelations oscillate with a frequency that depends on the refractory properties or decay exponentially with a diverging damping constant (for “weak” refractory properties). We also investigate the effect of the synaptic time constants. It is shown that these time constants may, apart from their inuence on the asymmetric peak arising from the direct synaptic connection, also affect the long-term properties of the cross-correlations.
1 Introduction
One of the most important tools in current neurophysiological brain research is the determination of temporal and spatial correlations in neuronal activity. In 1967, Perkel, Gerstein, and Moore (1967a, 1967b) showed that the characteristics of the interactions between two neurons are reected in their temporal correlations. This work proposed that the spike correlations are a physiological indication for anatomical connections between the neurons. Many studies have used this approach to study the anatomy of different Neural Computation 14, 369–404 (2001)
° c 2001 Massachusetts Institute of Technology
370
Carsten Meyer and Carl van Vreeswijk
cortical areas (Toyama, Kimura, & Tanaka, 1981a, 1981b; T’so, Gilbert, & Wiesel, 1986; Reid & Alonso, 1995). However, it should be recognized that the spike correlations between neurons depend not only on the network architecture, but also on the properties of the individual neurons, as well as the dynamical state of the network. Indeed, in prefrontal cortex, the correlation in activity can vary with the task that is to be performed (Vaadia et al., 1995). And, in sensory cortices, the level of correlation is strongly stimulus dependent (Gray, Konig, ¨ Engel, & Singer, 1989; deCharms & Merzenich, 1996; for a comprehensive overview, see Gray, 1999). These variations in spike correlation presumably reect differences in the dynamical state of the cortical network. Unfortunately, the interplay among the factors in generating the correlations is highly complex, making it difcult to interpret experimentally obtained correlations. Therefore, a theoretical framework is needed to analyze cross-correlations in different functional, anatomical, and dynamical contexts, enabling comparisons to experimental measurements and facilitating the interpretation of observed results. The importance of the network dynamics for the cross-correlations has been stressed by Ginzburg and Sompolinsky (1994), who showed that the temporal correlations between a pair of neurons may even in an “asynchronous state” be dominated by the cooperative dynamics of the network the neuron pair belongs to. In this analysis, the network consists of neurons described by asynchronously updated Ising spins. The irregularity of the neuronal activity is described by thermal noise, and it is shown that for noise levels close to the bifurcation point, the crosscorrelations become large. A major limitation of this theory is the absence of refractoriness of the neurons. This leads to autocorrelations of the neuron that decay exponentially. Viewed on a timescale of single spikes (i.e., some milliseconds), this does not even qualitatively correspond to experimental results. Thus, the biological interpretation of their theory in terms of single spikes is not clear. Another problem is that within their model, due to the lack of refractoriness, a single neuron without noise always approaches a state of either maximal or minimal activity. Noise has to be added to induce an intermediate activity level. The aim of this article is to extend the theory of Ginzburg and Sompolinsky to networks of spiking neurons. In our approach, a single neuron is described by the spike response model introduced by Gerstner and van Hemmen (1992) which we call renewal neuron. We calculate and analyze the cross-correlations in a large but nite stochastic, homogeneous, fully connected excitatory network in a stable stationary state. We show that the temporal correlations depend crucially and in a complex manner on the refractory properties of the neurons—apart from the dependence on the synaptic transmission and the network dynamics. In section 2, we dene the network model for dynamics in discrete and continuous time, based on the spike response model. In section 3 we analyze the basic properties of a single renewal neuron, following Gerstner
Stochastic Networks of Spiking Neurons
371
and van Hemmen (1992). In section 4, we summarize the basic properties of a stochastic, homogeneous, fully connected excitatory network of renewal neurons within mean-eld theory. Section 5 presents the analysis of the temporal cross-correlations in a large but nite network in a stable, stationary state, for dynamics in both discrete and continuous time. Finally, in section 6, we summarize and discuss our results. The mathematical theory is presented in the appendix. 2 Model
A simple way to introduce relative refractory properties into a neuron model is given within renewal theory (Cox, 1962; Perkel et al., 1967a, 1967b; Stein, 1967a, 1967b; Tuckwell, 1988; Gerstner, 1995). The basic assumption is that the internal state of the neuron, that is, its refractory properties, depends only on the time since the neuron has emitted its most recent spike; previous spikes do not contribute. This assumption can be justied if the range of the refractory properties is smaller than the average interspike interval. Due to the renewal assumption, the state of neuron i is completely characterized by two quantities: the neuron’s input at time t and a variable m i ( t) , denoting the time interval that has passed since the neuron emitted its last spike. If m i ( t) D º, then neuron i emitted its last spike at time t ¡ º. m i ( t) D 0 is equivalent to emission of a spike at time t. We investigate two different models, one using discrete time and the other continuous time. The discrete-time model is introduced as the conceptually and theoretically simpler model. It has the advantage of offering more analytical insight and is numerically the more tractable. For reasons of simplicity, a synaptic transmission function is not included in the discretetime model. The continuous-time model, on the other hand, is biologically more realistic. In the discrete-time model, we assume that the probability that a neuron will re within the time interval D t is given by P ( f ire ) D g[hin ( t) ¡ Jr (m (t) ) ¡ #]D t (Gerstner & van Hemmen, 1992). Here hin is the part of the potential due to the various inputs into the neuron, # is the threshold of a neuron at rest, and Jr describes the effects of the last spike, such as an afterhyperpolarizing (AHP) current and a temporary increase in the threshold. We denote the latter by the refractory properties of the neuron and refer to Jr as refractory function. The function g ( h ) describes the probability of ring, given the net membarane potential h D hin ¡ Jr (m ) ¡ #. It is a sigmoidal function that subsumes the various sources of noise in the system. This sigmoid is an increasing function that goes from 0 for h ! ¡1 to 1 / D t for h ! 1. Stochastic dynamics in discrete time (Gerstner & van Hemmen, 1992) is given by P (m i ( t C D t) D 0 | fm ( t) g) D g[hin i ( t ) ¡ Jr (m i ( t ) ) ¡ #i ]D t
(2.1)
372
Carsten Meyer and Carl van Vreeswijk
and for º > 0 P (m i ( t C D t) D º | fm ( t) g) © ª D 1 ¡ g[hin i ( t ) ¡ Jr (m i ( t ) ) ¡ #i ]D t dº,m i ( t ) C D t ,
(2.2)
where D t is the discrete time step. Equation 2.1 gives the probability of neuron i to emit a spike at time t C D t, given the refractory states fm (t )g of all the neurons in the network at time t. As explained below, the dependence 6 i is through the network feedback hin ( t ). of this probability on m j for j D i We dene a continuous model as the limit D t ! 0 of the discrete model. Introducing a probability density P (m ) by P (m ) D m D P (m ) and a transition rate w ( h ) into the active state, the dynamical equations of the continuous model are in integrated form given by Z 1 ( ) (m P i t D 0) D (2.3) dfm (t )g w[hin i ( t ) ¡ Jr (m i ( t ) ) ¡ #i ]P (fm ( t ) g) 0
and, for m > 0, @P (m i ( t) D m ) @t
D ¡
@P (m i ( t ) D m ) @m
Z ¡
1 0
dfm ( t) g
£ w[hin i ( t ) ¡ Jr (m i ( t ) ) ¡ #i ]P (fm ( t ) g) ,
(2.4)
where dfm (t )g refers to integration over all neurons in the network. Notice that for the transition rate, we use the notation w in the continuous model rather than g. This is to emphasis the difference between these two functions: w ( h) is the instantaneous transition rate, and g ( h ) D t is the probability of transition averaged over a nite time window D t. The two functions are related by the requirement that for constant h, the portion of neurons remaining inactive for the time step D t should be the same in the discrete and the continuous model: exp(¡D t w ) D 1 ¡ gD t, yielding w ( h) D ¡
£ ¤ 1 ln 1 ¡ g (h ) D t . Dt
(2.5)
The postsynaptic potential hin i ( t ) consists of two parts: syn ext hin i ( t ) D Ii ( t ) C hi ( t ).
(2.6)
The external inputs, Iiext , describe the inputs from other parts of the brain. syn The second term, hi , describes the feedback input from the other neurons in the network and is given by syn hi ( t ) D
N X jD 1 jD 6 i
Jij 2 (m j ( t) ) ,
(2.7)
Stochastic Networks of Spiking Neurons
373
where Jij are the synaptic efcacies, and the postsynaptic response function 2 (m ) ¸ 0 describes the time course of the synaptic transmission (GerstnerR & van Hemmen 1992); it is chosen to satisfy 2 (m ) D 0 for m < 0 P 1 and 0 dm 2 (m ) D 1, or m 2 (m ) D 1, respectively, for the continuous- and discrete-time model. Note that we have made the further simplication that only the most recent spike from each presynaptic neuron is taken into account for synaptic transmission; if a given presynaptic neuron emits more than one spike in a short time period, the inuence of spikes other than the last one is neglected. With these denitions, we have a model in which the probability that neuron i emits an action potential depends on hi and m i , while hi is given by Iext i and fm g, the refractory state of all neurons in the network. Thus, in the discrete-time model, one can construct the transition matrix P (fm ( t C D t) g|fm ( t) g, fIext ( t) g) , which describes the probability of being in the state fm ( t C D t) g at time t C D t, given that at time t, the system is in state fm ( t) g and the inputs are given by fI ( t) g. In the continuous-time model, one nds an expression for the evolution of P (fm ( t) g|fIext (t )g) . (For details, see the appendix.) For the refractory function,Jr , we choose Jr (m ) D
» 1 c m
for m < tabs for m ¸ tabs
(2.8)
for dynamics in discrete time and » Jr (m ) D
1
c m ¡tabs
for m · tabs for m > tabs
(2.9)
for continuous time. The parameter c > 0 measures the strength of the refractory function. Using the respective dynamical equations 2.1 and 2.2 for discrete time and 2.3 and 2.4 for continuous time, respectively, both models yield nearly identical results for the autocorrelation function and the interspike interval distribution, provided that the activity is not too high. During the absolute refractory period tabs , the neuron cannot emit another spike. For m > tabs , Jr (m ) can be viewed as enhancing the effective threshold # of the neuron to become active again (Horn & Usher, 1989; Gerstner & van Hemmen, 1992) (“relative refractoriness”). The inclusion of refractory properties allows an interpretation of the active state m D 0 as the emission of a single spike, whereby the emission of a spike is described as a binary event in time without modeling the detailed time course of the neurons’ membrane potential. Real neurons have an absolute refractory period of at least 1 millisecond. It is convenient to rescale time to this period. Accordingly, throughout this study, we will use tabs D 1 DO 1 ms. In the discrete-time model, we will
374
Carsten Meyer and Carl van Vreeswijk
for simplicity also set the time step D t to D t D 1. For the spike emission probability g, we will use g (h ) D
1 (1 C tanh (b h ) ) 2
(2.10)
in the discrete model. Here, the noise parameter b measures the degree of stochasticity (b ! 1 leads to the deterministic case). For this choice of g and D t, the transition rate w in the continuous model is, according to equation 2.5, given by ³ ´ 1 (1 ¡ tanh (b h ) ) . (2.11) w ( h) D ¡ ln 2 This rate leads to a linear dependence on h for large h and should be contrasted to the exponential rate w ( h) D exp(h) / t0 suggested by Gerstner and van Hemmen (1992). For the time course, 2 , of the synapses, we will, for simplicity, use 2 (m ) D dm , 0 in the discrete-time model, and in the continuous-time model we will use the “alpha function” (Jack, Noble, & Tsien, 1975; Brown & Johnston, 1983), 2 ¡am H (m ) . 2 (m ) D a m e
(2.12)
Here, a is the synaptic rate constant, and H (m ) is the Heaviside function, H (m ) D 0 for m < 0 and H (m ) D 1 for m ¸ 0. Because in this study we want to concentrate on the effects of the refractoriness, we will keep the synaptic coupling as simple as possible. We therefore consider only networks with equal all-to-all coupling: Jij D
¢ J0 ¡ 1 ¡ di, j . N¡1
(2.13)
We also assume that the external input into all neurons is identical, Iiext (t ) D I ( t), and we assume that the threshold is identical for all cells, #i D #. This homogeneity assumption leads to the fact that all neurons have the same mean ring rate and autocorrelations. The cross-correlations are also identical for each pair of neurons. 3 Single Renewal Neuron
Before we consider the properties of networks of renewal neurons, it is useful to summarize some basic properties of a single renewal neuron, following the analysis by Gerstner and van Hemmen (1992). The input of the neuron is assumed to be stationary and is given by ext hin i ( t ) D Ii D I
(in the rest of this section, the index i will be dropped).
(3.1)
Stochastic Networks of Spiking Neurons
375
In stochastic dynamics, the time between two consecutive spikes is characterized by the interspike interval distribution D (t ) . In continuous time, D (t ) is given by (
0 D (t ) D w exp(¡ R t dº w ) t º t abs
for for
0 · t < tabs , t ¸ tabs
(3.2)
with wm :D w[I ¡ Jr (m ) ¡ #].
(3.3)
The frequency of the neuron is given by the mean interspike interval ht i (Gerstner & van Hemmen, 1992): º(I ) D
1 1 n R o. D R 1 m ht i tabs C t dm exp ¡ t dº wº ( I ) abs
(3.4)
abs
Some examples of gain functions º( I ) are presented in Figure 1. In addition to the interspike interval distribution and the gain function, the ring pattern of the neuron is conveniently described by the autocorrelation function. The (reduced) autocorrelation Ai is dened as the joint probability that neuron i emits action potentials at times t and t C t , minus the uncorrelated expectation value:
Ai ( t, t C t ) D P (m i ( t) D 0, m i (t C t ) D 0) ¡ P (m i (t ) D 0) P (m i ( t C t ) D 0) .
(3.5)
In the stationary states, renewal theory provides a simple relation between the interspike interval distribution and the autocorrelation: Z
A (t ) D
t 0
µZ dm D (t ¡ m ) A (m ) C P0 D (t ) C P 02
t 0
¶ dm D (m ) ¡ 1 . (3.6)
Here we have used P 0 to denote P (m i ( t C t ) D 0) . Note that equation 3.6 describes only the bounded part of the autocorrelation. It is easily seen from denition 3.5 that A (t ) diverges for t D 0; the central peak is given by P 0 d (t ). This peak is not included in equation 3.6. In Figure 2, the interspike interval distribution D (t ) (A) and the autocorrelation A (t ) (without central peak, B) are shown for various parameters c and b. For small c (weak refractoriness) and small b (high degree of stochasticity), D (t ) has an asymmetric shape with a long tail. The autocorrelation A (t ) increases monotonically after the minimum, which is caused by the refractory properties. For increasing c (i.e., stronger refractoriness) and increasing b (lower degree of stochasticity), D (t ) becomes more and
376
Carsten Meyer and Carl van Vreeswijk
Figure 1: Gain function (frequency) º( I ) of a single renewal neuron with stationary input I. The lines labeled “determ.” show results for deterministic dynamics, for the (a) discrete-time and (b) continuous-time model. The other lines are obtained using stochastic dynamics with noise parameter b. The line labeled “discr.” shows a gain function in discrete time, for a refractory function Jr , equation 2.8, with c D 1.0, b D 10.3, and # D 0.1020. The other lines show gain functions in continuous time (Jr from equation 2.9, threshold, # D 0.1). The intersection of the gain function º( I ) with the identity I gives the xed points in a network with synaptic efcacy J0 D 1.0. For the example in discrete time, the xed point P0 D 0.05, marked by , is unstable (see Figure 4B). º is given in 1000 Hz, tabs D 1.0.
more symmetric around the average interspike interval ht i, and the autocorrelation becomes oscillatory with a damping constant that increases with increasing c and b. From renewal theory, similar equations can be derived for dynamics in discrete time. Choosing the refractory function, equation 2.8, and the transition probability, equation 2.10, in the discrete-time model (instead of equations 2.9 and 2.11, respectively, in the continuous model), the numerical results for the interspike interval distribution and the autocorrelation are almost identical, provided the activity is not too large (· 50 Hz).
Stochastic Networks of Spiking Neurons
377
Figure 2: (A) Interspike interval distribution D (t ) and (B) autocorrelation A (t ) (without central peak), of a single renewal neuron with stationary input I for various noise parameters b and various strengths c of the refractory function Jr . Dynamics either in discrete time with Jr from equation 2.8 or in continuous time with Jr from 2.9, give the same results. The activity P 0 is kept constant by adjusting the threshold d. P 0 D 0.05 D O 50 Hz, I D 0.05, tabs D 1.0.
Figure 2 demonstrates that the interspike interval distribution and especially the autocorrelation of a renewal neuron are qualitatively similar to experimental results. Thus, the renewal neuron appears to be a suitable model to describe a “regular ” (nonadaptive, nonbursting) neuron in terms of the generation of single spikes (Gerstner & van Hemmen, 1992). Yet the model is simple enough to allow to calculate the cross-correlations in sta-
378
Carsten Meyer and Carl van Vreeswijk
tionary states, as will be shown in section 5. Therefore, it is a convenient model to study cross-correlations on the timescale of single spikes. 4 Mean-Field Theory
Before we consider the cross-correlations we rst summarize the mean properties of homogeneous networks of renewal neurons. The mean-eld theory of such networks has been studied intensively (for a review, see Gerstner, 1995). The aim of this section is to introduce the notation used for calculating the cross-correlations (section 5) and summarize some basic properties of the network that are important for an interpretation of the cross-correlations. In this section, we focus on dynamics in discrete time, equations 2.1 and 2.2; the model with continuous time behaves qualitatively similarly. For simplicity, we assume that the network receives no external input, Iiext ( t) D 0, so that the postsynaptic potential hin i ( t ) satises hin i ( t) D
N J0 X dm ( t ) ,0 . N ¡ 1 jD 1 j
(4.1)
jD 6 i
In mean-eld theory (N ! 1), the postsynaptic potential can be replaced by its (noise) average hhin i ( t ) i. In a homogeneous network, the homogeneous solution Pm ( t) :D P (m ( t) D m ) (independent of i) is a consistent solution of in the network dynamics. Thus, hin i ( t ) can be written hi ( t ) D J 0 P 0 ( t ) , and the network dynamics is given by P(t C 1) D T (t ) ¢ P(t) , with 0
1 P0 (t) B P1 ( t ) C B C B C P(t) :D BP2 (t) C , B P3 ( t ) C @ A .. .
(4.2) 0
gt0 B1 ¡ gt 0 B B T ( t ) :D B 0 B 0 @ .. .
gt1 0 1 ¡ gt1 0 .. .
gt2 0 0 1 ¡ gt2 .. .
1 ¢¢¢ ¢ ¢ ¢C C ¢ ¢ ¢C C ¢ ¢ ¢C A
(4.3)
¢¢¢
and gmt :D g ( J 0 P 0 ( t) ¡ Jr (m ) ¡ # ).
(4.4)
The xed point P¤ D T ¢ P¤ is stable if all eigenvalues lk of the matrix 0 P1 1 O m Pm C g 0 ¢¢¢ g1 g2 m D0 g B ¡gO 0 P 0 C 1 ¡ g0 0 0 ¢ ¢ ¢C B C B ¡gO 1 P1 1 ¡ g1 0 ¢ ¢ ¢C S :D B (4.5) C B ¡gO 2 P2 0 1 ¡ g2 ¢ ¢ ¢C @ A .. .. .. . . . ¢¢¢
Stochastic Networks of Spiking Neurons
satisfy |lk | < 1, where @g ( h) . gO m :D J0 @h hD J0 P0 ¡Jr (m ) ¡#
379
(4.6)
Note that with g ( h) from equation 2.10, gO m contains an explicit factor b from ( ) the derivative @g@hh . Thedynamics, equation 4.2, has been studied intensively. Here, we restrict ourselves to summarizing some basic effects of the refractory function Jr on the existence and basic properties of stable, stationary states for dynamics in discrete time. For neurons without refractoriness (Jr (m ) D 0 8 m ) or with absolute refractoriness of one time step only (Jr (0) D 1, Jr (m ) D 0 for m ¸ 1), the xed-point structure is well known. There is either one stable xed point h l h m P 0 or three xed points Pl0 < Pm 0 < P 0, where P 0 and P 0 are stable and P 0 is unstable; stable oscillations do not exist. The dynamics can become more interesting for neurons with relative as well as absolute refractory properties. For suitable parameter values, even if there is only one xed point, unstable or stable oscillations around this xed point may show up. An example is shown in Figure 1, where the xed point P 0 D 0.05 for c D 1.0 is unstable (see Figure 4). We now address the existence of stable stationary states in dependence of the noise parameter b. First, we consider the network behavior if the noise level b is varied, where the neuron parameters c and # and the synaptic efcacy J0 are kept constant. Figure 3 shows the temporal evolution of the network activity P 0 (t ) in dependence of the noise parameter b for the same initial distribution. For low noise levels (b D 13.0), the network activity evolves into a stable, stationary state with “low” activity P 0 . Since P 0 < #, the network activity is completely noise driven. Increasing the noise level (decreasing b) drives the network from a stationary state into stable oscillations (see Figure 3 for b D 9.8). This is due to the fact that increasing the amount of noise leads to an increase of the mean network activity P 0 , overcompensating for the stabilizing effect of noise. Even larger noise levels (e.g., b D 5.0) stabilize the network activity again, but at a much larger activity level. As in the case of low noise (b D 13.0), the network activity is highly stochastic, with the noise stabilizing the network activity, which is slightly above threshold. It can be concluded that the network is susceptible to stable oscillations especially in parameter regions, where the average network activity is of the order of the threshold. In discrete dynamics, the network activity in a stable stationary state is highly stochastic: either the activity is completely noise driven (“low” activity states), or a high degree of noise stabilizes the network activity (“high” activity states). A second way to vary the noise level b is to keep the average network activity, P 0 , constant, by adjusting the threshold # of the neurons. This allows studying the dependence of the cross-correlations on the noise level for constant activity (see section 5). An example of the network behavior for
380
Carsten Meyer and Carl van Vreeswijk
Figure 3: Activity P0 ( t) as a function of time t in a homogeneous, innitely large network of neurons with absolute and relative refractoriness for the same initial distribution but different noise parameters b. Parallel, stochastic dynamics in discrete time, Jr , from equation 2.8, with c D 1.0, J 0 D 1.0, # ¼ 0.1032. The initial distribution is the same as that of Figure 4A. Compare with Usher, Schuster, and Niebur (1993).
different noise levels but constant activity P 0 is presentedin Figure 4. In Figure 4A, the network asymptotically approaches the xed point P 0 D 0.05. In Figure 4B, the noise level is reduced (b increased), with simultaneous decrease of the threshold # in order to keep the xed-point value at P 0 D 0.05. P 0 then becomes unstable, and the network is driven into stable oscillations. Thus, noise-maintained stationary states in homogeneous networks of neurons with sufciently strong relative refractoriness become unstable against oscillations if the noise level and the threshold # are decreased, keeping the network activity P 0 constant. The critical point bc depends on the strength c of the refractoriness. For weak refractoriness, the network becomes unstable against a xed point with larger activity. To summarize, two important points have to be stressed. First, due to relative refractoriness, the xed-point activity in an innitely large, fully connected, homogeneous, excitatory network of renewal neurons may assume rather low and medium values even without noise (Gerstner & van Hemmen, 1992; Usher et al., 1993). Second, due to the existence of stable oscillations in discrete-time dynamics, the activity in stable xed points is
Stochastic Networks of Spiking Neurons
381
Figure 4: Activity P 0 ( t ) as function of time t in a homogeneous, innitely large network of renewal neurons with absolute and relative refractoriness. Parallel, stochastic dynamics in discrete time, Jr , from equation 2.8, with c D 1.0. The noise parameter b and threshold # are chosen so that P 0 D 0.05 is a stationary solution; J0 D 1.0. The initial distribution is the stationary distribution for the parameter values of A, slightly perturbed by setting Pm D 0 for m > 50 and normalizing P50 . Note the different scaling of the ordinate.
382
Carsten Meyer and Carl van Vreeswijk
highly stochastic—either purely noise driven or stabilized by a large amount of noise. 5 Cross–Correlations 5.1 General Analysis. In mean-eld theory, the dynamics of a homogeneous network of renewal neurons can be solved exactly, since in this case, the synaptic input can be replaced by its average value. For a nite network, the synaptic input of a given neuron in general deviates from its mean, leading to correlations between the neurons. Ginzburg and Sompolinsky (1994) have shown that for a large network (N À 1) in an asynchronous state, 1 these correlations are dominated by pairwise correlations, for which closed equations can be derived. In the following, the analysis of Ginzburg and Sompolinsky (1994) is extended to a large but nite (N À 1) homogeneous network of renewal neurons in a stable, stationary state (the mathematical theory is sketched in the appendix). The cross-correlation ¡ ¢ ( ) ( ) ( ) Cºv ij t, t C t :D P m i t D º, m j t C t D v ¡ ¢ ¡ P (m i ( t) D º) P m j ( t C t ) D v (5.1)
is the joint probability for neuron i to be in refractory state º at time t and neuron j to be in refractory state v at time t C t , minus the uncorrelated expectation value for this conguration. The stationary value is dened by (t ) :D lim Cºv ( ) Cºv ij ij t, t C t t!1
(5.2)
(t ) D Cjivº (¡t ) . For a homogeneous network with identical and satises Cºv ij couplings, Jij D
¢ J0 ¡ 1 ¡ di, j , N¡1
(5.3)
ºv the cross-correlations are identical for all neuron pairs i and j, Cºv ij D C , for 00 6 iD j, and the “spike-spike-cross-correlation” C (t ) :D C (t ) is symmetrical with respect to t :
C (t ) D C (¡t ) .
(5.4)
The autocorrelation is dened accordingly (compare with equation 3.5): ºv Aºv (t, t C t ) D Aºv i ( t, t C t ) :D C ii ( t, t C t )
(5.5)
1 An asynchronous state is dened by the condition that the cross-correlations vanish for N ! 1.
Stochastic Networks of Spiking Neurons
383
and in a stationary state A (t ) :D A 00 (t ) D A (¡t ) .
(5.6)
In the appendix, it is shown by expanding the postsynaptic potential in ( hin i t ) , equation 4.1, around its average value hhi ( t ) i D J 0 P 0 ( t ) for large networks N À 1 that the correlations in a stable, stationary state are determined in leading order by the two-point correlations, equation 5.1, for which closed equations can be derived. They can be written in matrix form: C (t C 1) D C (t ) ¢ S T C
± ² 1 A (t ) ¢ S T ¡ T T , N
(5.7)
where ¡
C (t ) :D Cºv (t )
¢ º,v¸ 0
and
¡
A (t ) :D Aºv (t )
¢ º,v ¸ 0
.
(5.8)
S and T are given in equations 4.5 and 4.3, respectively, and the superscript
T denotes the transpose of a matrix. The initial value C (0) of the recursion, equation 5.7, is given by the solution of µ ¶ 1 1 C (0) D S ¢ C (0) C A (0) ¢ S T ¡ T ¢ A (0) ¢ T T . (5.9) N N Similar to Ginzburg and Sompolinsky (1994), the dynamics of the correlations is governed by uctuations of the synaptic input of the “target neuron,” described by the derivative gO m , which is proportional to the synaptic efcacy J0 (see equations 4.6 and 4.5). In addition, due to the dynamics of the refractory states m , the correlations depend directly on the transition probabilities gm and 1 ¡ gm , as can be seen in the denition of the matrix S in equation 4.5. Since the “reference neuron” i also contributes to the synaptic input of the target neuron j, there is also a contribution of the autocorrelation A (t ) to the dynamics of the cross-correlations, which is proportional to the derivatives of the transition probabilities (term S T ¡ T T in equation 5.7) and can be viewed as the “source” of the cross-correlations (Ginzburg & Sompolinsky, 1994). If there is no direct synaptic connection from neuron i to neuron j, the contribution of the autocorrelation to the cross-correlation C ij vanishes. Because at the xed point the input into a neuron is constant, up to uctuations of order 1 / N, the autocorrelation of the neurons are, to leading order, the same as those we calculated in section 3 for the single neuron, provided that the appropriate level of input is taken. The uctuations add a term of order 1 / N to the autocorrelation, but because these only contribute a term of order 1 / N 2 to the cross-correlations, according to equation 5.7, we will not calculate this correction here.
384
Carsten Meyer and Carl van Vreeswijk
Expanding the spike-spike-cross-correlation C (t ) D C 00 (t ) with respect P 0º k k to the normal modes C0 (t ) :D º C (t ) Lº (with Lºk as ºth component of the kth left eigenvector of the matrix S ), the asymptotic behavior of C (t ) can be derived. Some general conclusions can be drawn from this analysis. First, in a stable xed point, the eigenvalues lk of S satisfy |lk | < 1 and the cross-correlations vanish asymptotically: limt !1 C (t ) D 0. If the xed point is unstable, the amplitude of C (t ) diverges in the order 1 / N, and the cross-correlations are of higher order (Ginzburg & Sompolinsky, 1994)— thus, the analysis presented in this work is no longer applicable. Second, it can be seen that the cross-correlations in a stable xed point are composed as a sum of monotonically decaying or damped oscillating normal modes, the source of which is given by the autocorrelation, convoluted and weighted by the eigenvalues of the stability matrix S . Note that in contrast to a respective network of neurons without refractory properties, oscillating cross-correlations are possible even in a completely homogeneous network, due to the dynamics of the refractory states m . Third, near a bifurcation point, the normal mode belonging to the largest eigenvalue |lk | of S , which satises |l k | ! 1, dominates the long-term behavior of the cross-correlations. If lk D 1 ¡ 2 is real, C (t ) decays monotonically; if l k D 1 ¡ 2 C i v t is complex, C (t ) oscillates with frequency v (compare Figure 6). Fourth, the longer the range is of the refractory function Jr , the more complex is the behavior of C (t ). In the following, some examples of cross-correlations are discussed in more detail. 5.2 Cross-Correlations: Discrete Time. For neurons without any refractoriness, the cross-correlation C (t ) in a homogeneous network with synaptic input, equation 4.1, decays monotonically after a peak at t D 1. This peak is due to the direct synaptic connection between the two neurons and vanishes if this direct connection is removed. If an absolute refractory period of one time step is included, Jr (0) D 1, Jr (m ) D 0 for m ¸ 1, C (t ) may show damped oscillations of period 2 due to the absolute refractoriness of the presynaptic and the postsynaptic neuron. In addition, C (0) and C (t ) may become negative. Close to the critical point, however, C (t ) decays asymptotically monotonously. For renewal neurons with relative refractoriness, in Figure 5 some examples for the spike-spike-cross-correlation C (t ) D C00 (t ), multiplied by the number N of network neurons, are shown for the refractory function Jr , equation 2.8, determined by numerical solution of equations 5.9 and 5.7. For reasons of comparison, we keep the network activity P 0 xed; each time the parameter values c or b are changed, the threshold # is adjusted accordingly to keep P 0 constant. For numerical reasons (matrix inversion), we focus on network states of “medium” activity P 0 D 0.05 DO 50 Hz (see also the appendix). Figure 5 shows that even for the simplest network architecture (completely homogeneous and fully connected) in a stable, stationary state,
Stochastic Networks of Spiking Neurons
385
Figure 5: Cross-correlation N ¢ C (t ) in a homogeneous network with N neurons (N À 1) with absolute and relative refractoriness. The network is in a stable stationary state with activity P 0 D 0.05 D O 50 Hz; the noise parameter b is varied (and the threshold # adjusted to keep P 0 constant). Dynamics in discrete time, Jr from equation 2.8, J0 D 1.0. (A) “Medium” refractoriness c D 0.2; the threshold drops from # ¼ 0.31 for b D 5.0 to # ¼ 0.16 for b D 11.0. (B) “Strong” refractoriness c D 1.0; the threshold drops from # ¼ 0.24 for b D 5.0 to # ¼ 0.11 for b D 9.8.
386
Carsten Meyer and Carl van Vreeswijk
the cross-correlations exhibit a large variety of qualitative and quantitative properties, depending on the strength c of the refractory function, the noise level b, and the activity P 0 . For small c, Figure 5A, the cross-correlation has an asymmetric peak at t D 1, which is the result of the direct synaptic connection between the neuron pair; this peak vanishes if the direct connection is removed. The peak value C (1) increases with b. The angular shape of the peak C (1) is a consequence of the d-form of the postsynaptic response function, as will be shown in the next subsection. The equal time cross-correlation C (0) rst decreases and then increases with b. For low noise (large b), some minima and maxima with period of about 1 / P 0 and decreasing amplitude show up. The damping constant of these oscillations increases with b. Close to the critical point, C (t ) decays exponentially (“weak refractoriness”; see also Figure 6). For large c, Figure 5B, C (t ) in general exhibits damped oscillations, with the damping constant increasing with b. The rst maximum may be located at t > 1, since the equal time cross-correlations may be strongly negative. The peak value itself increases with b, whereas C (0) rst decreases and then increases with b. Far below the critical point, the frequency º of the oscillations of C (t ) is nearly constant and corresponds to the frequency of a single neuron, whereas near the critical point, the frequency º depends strongly on b. Approaching the critical point, the asymmetric peak for t > 0 becomes a symmetric peak at t D 0. The damping constant diverges, and C (t ) oscillates with a frequency that strongly depends on the strength c of the refractory function. This is shown in Figure 6, where the frequency of C (t ) at the critical point is plotted in dependence of the strength c of the refractory function; the frequency of a single neuron is indicated by the solid line. Note that near the critical point, the cross-correlation C (t ) may oscillate even if—beyond the critical point—the network dynamics will evolve not into stable oscillations but into a xed point of larger activity. The dependence of C (t ) on the strength c of the refractory function for xed noise level b is shown in Figure 7. Figures 7B and 7C show the corresponding autocorrelation A (t ) (without central peak) and interspike interval distribution D (t ), respectively, in rst order. A (t ) and D (t ) correspond to the single-neuron results, as mentioned before. Near the critical point, the 1 / N-corrections from the crosscorrelations become relevant and cannot be neglected anymore (Ginzburg & Sompolinsky, 1994). We argued in section 4 that the activity in a stable, stationary state is highly stochastic—either purely maintained by noise or stabilized by a high level of noise. In the examples presented in this section, since P 0 < #, the activity is completely noise generated. This is reected in the monotonic shape of A (t ) and the asymmetric form of D (t ) . For a smaller activity P 0, the frequency and the amplitude of the crosscorrelation become smaller, and the equal-time cross-correlation in general
Stochastic Networks of Spiking Neurons
387
0.06 0.05
b
(c)
0.04 0.03 0.02 0.01 0
0
0.5
1
c
1.5
2
2.5
Figure 6: Dependence of the frequency ºb ( c) of the time-delayed crosscorrelation C (t ) at the critical point bcr on the strength c of the refractory function Jr , equation 2.8. The network is in a stationary state with activity P 0 D 0.05 DO 50 Hz (threshold # adjusted to keep P 0 constant). Dynamics in discrete time, J 0 D 1.0. The frequency of a single neuron is indicated by the solid line.
turns out to be positive. The qualitative dependence of the cross-correlation on c and b, however, is similar. 5.3 Cross-Correlations: Continuous Time. Within the discrete model analyzed in section 4 and the previous subsection, the absolute refractory period as well as the postsynaptic response function have been restricted to exactly one time step. This considerably simplies the equations for the crosscorrelations, enabling an effective calculation and analysis. The square form of the synaptic transmission we have assumed is, however, a poor description for the synaptic transmission, leading to the angular shape of the maximum of C (t ) (see Figures 5 and 7A). Obviously, within the discrete model, it is not possible to study the inuence of the synaptic transmission on the short-time and long-time behavior of C (t ) . For this purpose, we extend our analysis to the more general continuous model introduced in section 2. The dynamics of the continuous model is dened by equations 2.3 and 2.4 with hin i ( t ) from equation 2.7. For the postsynaptic response function 2 (m ),
388
Carsten Meyer and Carl van Vreeswijk
Figure 7: Cross–correlation N ¢ C (t ) (A); autocorrelation A (t ) , to leading order O (1) , without central peak (B); and interspike interval distribution D (t ) (C) for various values of the refractory strength c (threshold # adjusted to keep P0 constant), in a homogeneous network of N À 1 renewal neurons. The network is in a stable stationary state with P 0 D 0.05. Dynamics in discrete time, b D 10.0, J 0 D 1.0. The threshold drops from # ¼ 0.19 for c D 0.0 to # ¼ 0.14 for c D 0.6.
Stochastic Networks of Spiking Neurons
389
we choose the “alpha function,” equation 2.12. Due to the renewal assumption, only the most recent spike of each presynaptic neuron is taken into account; the contribution of spikes that have been emitted before is neglected. It is easily seen that this assumption is justied in asynchronous states with low activity. Introducing densities C (t ) and A (t ) for the (stationary) cross- and autocorrelations, respectively, the theory of the continuous model is developed in appendix A.3. It is shown that the cross-correlation in a homogeneous network is determined by a linear, inhomogeneous system of coupled integrodifferential equations; see equations A.22 and A.25. A few remarks are in order. First, the postsynaptic response function 2 (m ) directly enters into the equation for C (t ) , even though the mean network activity P 0 (for low activity) does not depend on 2 (m ) . If the synaptic transmission is—similar to the discrete model, using ad-distribution—modeled to be instantaneous, the cross-correlation C (t ) also shows a d peak. Second, comparing the equations for C (0) and C (t ) , it can be seen that the cross-correlation may be discontinuous at t D 0 (see appendix A.3). From Figure 8, it may be inferred, however, that a continuous shape of Jr and 2 (m ) smoothes this discontinuity. Figures 8 and 9A show some examples of N ¢ C (t ) in a homogeneous network of N À 1 renewal neurons with refractory function, Jr from equation 2.9, calculated by numerical solution of equations A.22 and A.25, far below the critical point. The network activity is xed to be P 0 D 0.02 DO 20 Hz. Figures 8 and 9A show that the short time behavior of the cross-correlations—far below the critical point—is characterized by an asymmetric peak, the shape of which is strongly inuenced by the postsynaptic response function 2 (m ) . For oscillating C (t ) , even its long time behavior—its frequency and damping constant—may depend on a (see Figure 8 for c D 0.4), if the strength c of the refractory function is not too large. Qualitatively the dependence of the cross-correlation C (t ) on the strength c of the refractoriness and the noise level b is similar to the discrete model (see also section 5.4). Figure 9 presents an example where the network activity P 0 exceeds the threshold #. Thus, the network activity is not maintained purely by noise, and a relatively low amount of noise is required to stabilize the xed point. This is reected in the oscillatory shape of the autocorrelation and the more or less symmetric interspike interval distribution (see Figures 9B and 9C), respectively. The fact that in contrast to similar P0 in the discrete model, the asynchronous state is nevertheless stable indicates that the network in continuous time is less susceptible to stable oscillations than the discrete model. 5.4 Comparison Between Discrete- and Continuous-Time Models. The examples in sections 5.2 and 5.3 show that the basic properties of the timedependent cross-correlation are very similar in the discrete and the continuous model. This is true for the existence of an asymmetric peak, the
390
Carsten Meyer and Carl van Vreeswijk
Figure 8: Cross-correlation N¢C (t ) in a homogeneous network of N À 1 renewal neurons in a stationary state with P 0 D 0.02. (threshold # adjusted accordingly). The gure shows the dependence on the time constant a of the postsynaptic response function 2 (m ) , for two sets of parameters for the noise level b and refractory strength c. The equal time cross-correlation C (0) is marked (see the legend). Dynamics in continuous time, tabs D 1.0, J0 D 1.0.
monotonous or oscillating form of the time-delayed cross-correlation, and its dependence on the strength c of the refractory function and the noise level b. The most striking difference, the different form of the asymmetric peak, is a consequence of the different description of the synaptic transmission in both models. Quantitatively, however, large differences can be observed in both models, as illustrated in Figure 10. First, in the continuous model, the crosscorrelation is much more damped than in the model with discrete time. This may indicate that in continuous time, there is less of a tendency for stable oscillations, which may be the result of the larger time constant of the postsynaptic response function (see also Abbott & van Vreeswijk, 1993). The effect of the discretization of time on the stability of the stationary solutions has not been investigated yet. Second, in the continuous model, we generally observed the equal-time cross-correlations to be positive, whereas they may be strongly negative in the discrete model, inuencing the location of the asymmetric peak (see Figure 10). Third, the amplitude of the extrema of
Stochastic Networks of Spiking Neurons
391
Figure 9: (A) Cross-correlation N ¢ C (t ) , (B) autocorrelation A (t ) (to leading order O (1) without central peak, and (C) interspike interval distribution D (t ) in a homogeneous network of N À 1 renewal neurons in a stationary state with P 0 D 0.02 (threshold # adjusted accordingly). The gure shows the dependence on the time constant a of the postsynaptic response function 2 (m ) and of the noise level b. (For low activity, 2 (m ) does not enter into the rst order of D (t ) and A (t ) .) Dynamics in continuous time, Jr , from equation 2.9, with c D 5.0, tabs D 1.0, J 0 D 1.0. The equal time cross-correlation C (0) is marked (see the legend).
392
Carsten Meyer and Carl van Vreeswijk
Figure 10: Cross-correlation N ¢ C (t ) in a homogeneous network of N À 1 renewal neurons in a stationary state with activity P 0 D 0.05 (threshold # adjusted to keep P 0 constant). Comparison of results for dynamics in discrete time (labeled “discr.”) with refractory function Jr from equation 2.8 and dynamics in continuous time (labeled “cont.”), Jr from equation 2.9, with c D 1.0, b D 8.0, tabs D 1.0, J0 D 1.0. The equal-time cross-correlations are marked with a symbol (see the legend).
the time-delayed cross-correlation is much larger in the discrete than in the continuous model. Fourth, from the continuous model with varying time constant of the postsynaptic response function, it is seen that even the long time behavior of the time-delayed cross-correlation may be inuenced by the time constant of the synaptic transmission (see Figure 8). Note also that for large activity, the autocorrelation is much more damped in the continuous model than in the discrete model (not shown). It is not clear yet to which extent these differences are due to the time constant of the postsynaptic response function 2 (m ) only or also an effect of the discretization of time. To summarize, the simple discrete model allows an efcient calculation and analysis of the qualitative properties of the cross-correlations for medium and large frequencies (about 50 Hz), including an analysis close to the critical point. The quantitative properties especially of the short time
Stochastic Networks of Spiking Neurons
393
behavior of the time-delayed cross-correlations, however, can be addressed only within the more complicated framework of a generalized model with postsynaptic response function; such as the continuous model. 6 Summary and Discussion
We have analyzed the cross-correlations in the stable stationary state of a large but nite, homogeneous, fully connected, excitatory network of renewal neurons with stochastic dynamics, in both discrete and continuous time. Our work extends the analysis of Ginzburg and Sompolinsky (1994) to neurons with an absolute and relative refractory period. This allows for the interpretation of the cross-correlations on a timescale of single spikes. The model neuron used is based on a model proposed by Gerstner and van Hemmen (1992). We have shown that even for the simplest network architecture—homogeneous and fully connected—the cross-correlation of a pair of neurons can have a wide variety of qualitatively different features, even if the network is in a stable asynchronous state with low activity. Below the critical point, the cross-correlation exhibits an asymmetric peak resulting from the direct connection of the neuron pair. The detailed shape of the peak is strongly inuenced by the synaptic transmission. Beyond the peak, the cross-correlation decays monotonically (for weak refractory properties) or shows damped oscillations (for stronger refractory properties), even if to leading order the autocorrelation does not oscillate. As the critical point is approached, the amplitude of the cross-correlations increases rapidly, and the damping time constant diverges. For strong refractory properties, the cross-correlations start to oscillate without damping with a frequency that strongly depends on the strength of the refractory function. In this regime, the behavior and the magnitude of the cross-correlation are dominated by the network dynamics. In general, the details of the cross-correlation—its qualitative shape and its magnitude—depend crucially on the strength of the refractory properties, the noise level, the details of the synaptic transmission, the level of activity, and the closeness of the network dynamics to a bifurcation point. In addition, the analysis shows that the refractory properties of all neurons—the target neuron, the reference neuron, and the other neurons in the network—inuence the behavior of the cross-correlation. Because of this dependence on a multitude of factors, there is no simple relation between the cross-correlations and the neurons’ refractory function or interspike interval distribution. Instead, any interpretation of the cross-correlations and their magnitude has to take into account all of the above-mentioned factors. The analysis was performed in a simple network with homogeneous all-to-all coupling, thus highlighting the effect of the single neuron and network dynamics, as opposed to the network architecture, on the crosscorrelations. The analytical techniques used here can be readily adapted to
394
Carsten Meyer and Carl van Vreeswijk
describe correlations in networks with several populations, with different intrinsic dynamics or coupling strengths (Ginzburg & Sompolinsky, 1994). In addition, random dilution of the coupling matrix can also be dealt with in a straight-forward manner. We have described a network in which the strength of the individual synapses scales as 1 / N. This has the effect that the uctuations in the inputs p scale as 1 / N. Nevertheless, to obtain a variability in the spiking of the neurons that is biologically plausible, the spiking of the neurons is governed by a stochastic process, described by g (h ) in the discrete and w ( h) in the continuous model. This process can be viewed as describing either an intrinsic stochasticity of the spike-generating mechanism of the neuron or as being due to uctuations in the input into the cells from outside the network. In the latter view, it should, however, be emphasized that these uctuations are uncorrelated, both in time and between neurons. The cross-correlations studied here arise purely as a result of the network dynamics. Our analysis can be extended to include correlations in the input, provided that these are sufciently weak, that is, the cumulants of the input statistics scale as N ¡k/ 2 . They will contribute an extra “source” in the matrix equations 5.7 and 5.9. Still, the theory can deal only with cross-correlations that are of order 1 / N. Only close to the bifurcation point, where the asynchronous state becomes unstable, do they grow to an appreciable value. It is still a matter of debate whether it is biologically plausible for the cross-correlations to be this weak. In experiments, one regularly obtains cross-correlations with peak values that are several percent of the products of their rates (Vaadia et al., 1995). But cortical neurons receive input from thousands of other cells (Stevens, 1989), implying, in our theoretical model, that the cross-correlations should be of order 1/1000, unless the network operates very close to the bifurcation point. For the network to be close to the bifurcation point, the parameters have to be ne-tuned. It may be the case that being close to the bifurcation point has functional advantages, so that an argument could be made for a learning mechanism that puts the network near this point (Ginzburg & Sompolinsky, 1994). An alternative scenario that has recently gotten considerable attention is the possibility that the individual synapses are much stronger than of order 1 / N, and yet the mean input is small due to the fact that the total excitatory input is nearly balanced by the total inhibitory input. This scenario has recently been investigated by Amit & Brunel (1997a, 1997b) and van Vreeswijk and Sompolinsky (1996, 1998) in extremely diluted networks. In such networks, the cross-correlations can be reasonably large, even far from the bifurcation point, if there is a direct connection between the cells, due to the relatively strong synapses. The mean correlations are still of order 1 / N, due to the sparsity of the connections. Experimental data suggest that the extremely small fraction of neuron pairs with appreciable cross-correlation is an unrealistic feature of these networks. Models with a connectivity that is less diluted but still with relatively strong synapses are needed to account
Stochastic Networks of Spiking Neurons
395
for this nding. Unfortunately, the analytical approach presented here cannot be used in such networks. This is because in networks with a coupling that does not scale as 1 / N, the higher cumulants of the spike distribution are not negligible compared to the two-point cross-correlation. A theoretical frame-work to study this scenario is not currently available, so that a comparison of this scenario with the model presented in our article close to the bifurcation point cannot be undertaken yet. Several other authors have recently considered the dynamics of networks of spiking neurons with a considerable degree of stochasticity (Gerstner, 2000; Spiridon & Gerstner, 1999; Brunel & Hakim, 1999). Gerstner (2000) and Spiridon and Gerstner (1999) have investigated the response of such networks to small variations in external input and characterized this response analytically. This was done by studying the mean network response to these input changes. Whereas the analysis of such an input change within meaneld theory is straightforward within our model, its effect on the crosscorrelations cannot be calculated within our framework, since the assumption of an asynchronous state and “weak” cross-correlations is no longer justied. The effect of an input change on the cross-correlations is still an open issue. Nevertheless, one can relate the cross-correlations in a network that receives constant input to the mean network response to weakly varying input; the time course of both is dependent on the perturbation matrix S (see equation 4.5). Brunel and Hakim (1999) consider fast oscillations in a network of integrate-and-re neurons receiving noisy external input. Their study is mainly concerned with sparsely connected networks in which the asynchronous state is unstable, whereas our focus is on fully connected networks in stable asynchronous states. Still, it is interesting to ask how their study relates to the model presented here. However, the different network architecture and the different dynamical state, as well as the considerable difculty of the calculations in both models, make such a comparison extremely involved. There are also indications that networks of deterministic integrate-and-re neurons with stochastic inputs and models involving stochastically spiking neurons behave qualitatively differently. The rapid response to external input found for stochastic neurons (Gerstner, 2000) is absent for integrate-andre neurons with input noise (Brunel, Chance, Fourcaud, & Abbott, 2001). Hence, a detailed comparison between the results of Brunel and Hakim (1999) and the work presented here is beyond the scope of this article. The main goal of this work was the theoretical calculation and analysis of the cross-correlations in large networks. Thus, we posed rather strong constraints on our model, like the homogeneity assumption and full connectivity. It is an interesting question how our results apply to real networks, lifting some of the constraints. Such an investigation and quantitative comparison to experimental results, carried out by numerical simulations of realistic networks, is an interesting topic for future research, but beyond the scope of this article.
396
Carsten Meyer and Carl van Vreeswijk
Appendix
In this appendix, we present the general theory for calculating the temporal cross-correlations in a large, fully connected, excitatory network (N À 1) in a stable, stationary state. For simplicity, it is assumed that the network is homogeneous, #i D #, and there is no external input, Iiext (t) D 0. A.1. Expansion Around the Mean-Field Results. In a large but nite network (N À 1) in a stable, stationary state, it is expected that the network dynamics deviates only slightly from the mean eld results. For synaptic couplings, equation 2.13, and postsynaptic response function 2 (m ) D dm ,0 , the postsynaptic potential, equation 4.1, can be written as
hin i ( t) D
N J0 X dm ( t ) ,0 D J0 P 0 ( t) C dhin i ( t) , N ¡ 1 jD 1 j
(A.1)
jD 6 i
where dhin i ( t) D
¢ J0 X ¡ dm j (t ) , 0 ¡ P 0 ( t) . N ¡ 1 jD6 i
(A.2)
In an asynchronous state, dm j ( t) , 0 is to leading-order independent for different j, the mean of dm j ( t ) ,0 ¡ P 0 ( t) is 0, and its variance of order 1. This implies p dhin i ( t ) D O (1 / N ) , and we can expand the spike emission probability ¡ ¢ t O mt i ( t ) g hin i ( t ) ¡Jr (m i ( t ) ) ¡# D gm i ( t) C g
¢ 1 X¡ dm k ( t) , 0 ¡ P 0 ( t) N ¡ 1 kD6 i
1 OO t C g m i ( t) ( N ¡ 1) 2 £
XX¡ ¢¡ ¢ dm k ( t ) , 0 ¡P 0 (t) dm l ( t ) ,0 ¡P 0 (t) ,
(A.3)
t 1 @2 g ( h) , gOOm :D J20 2 @h2 hm
(A.4)
6 i lD 6 i kD
with gmt from equation 4.22,
gO mt :D J0
@g ( h) @h hm
and
where hm D J0 P 0 ( t) ¡ Jr (m ) ¡ #. In addition, the joint probability distrim bution of the network states can be expanded, setting dm i ( t ) ,m D Pi ( t) C m (dm i ( t ) ,m ¡ Pi (t) ) , and so on (cumulant expansion): ¡ ¢ « ¬ P m i ( t) D º, m j (t0 ) D m D dm i ( t ) ,º dm j ( t0 ) ,m m
ºm 0 0 D Pºi ( t) Pj ( t ) C Cij ( t, t )
(A.5)
Stochastic Networks of Spiking Neurons
397
¡ ¢ P m i ( t) D º, m j ( t0 ) D m , m k ( t00 ) D v « ¬ D dm i ( t ) ,º dm j ( t0 ) ,m dm k ( t00 ) ,v ¡ ¢ m ( 00 ) C Pvk ( t00 ) Cºm ( 0) D Pºi ( t) P m j ( t0 ) D m , m k ( t00 ) D v C Pj ( t0 ) Cºv ik t, t ij t, t D¡ ± ² E ¢ ¡ ¢ m C dm i ( t) ,º ¡ Pº dm j ( t0 ) ,m ¡ Pj ( t0 ) dm k (t00 ) ,v ¡ Pvk (t00 ) , (A.6) i ( t) m and so on for higher-order correlations. Here we have used Pi ( t) to denote P (m i ( t) D m ) . Using equation A.3, the cumulants can be evaluated. As mentioned above, if the asynchronous state is stable, one expects that the m 1 P (d moments N¡1 m j (t ) ,m ¡ Pi ( t ) ) become small in the large N limit. Since 6 i jD p these moments are of order 1 / N, the kth-order moment should be of order 1 / N k/ 2 , if all indices i, j, k, . . . are different, of order 1 / N ( k¡1 ) / 2 if one pair of indices is the same, and so on. If this ansatz is made, the joint distribution can be calculated to leading order in 1 / N, since for the kth moment, only cumulants of equal or lower order contribute. Using this ansatz in equation A.3, it can then be shown that the kth cumulant is indeed of order 1 / N k/ 2 , verifying the self-consistency of the ansatz. Thus, within this ansatz, higher-order v terms in equation A.3, as well as the three-point cross-correlations Cºm (last ijk summand in equation A.6), can be neglected.
A.2 Discrete Model. We are particularly interested in the “spike-spikecross-correlation,” C ( t, t0 ) D C00 ( t, t0 ). However, because the probability of ring an action potential depends on the time since the last spike, m , this depends on Cmº for all m and º. We therefore have to evaluate all of these. Because m is reset to 0 after a spike is emitted, Cm 0 and Cmº have to be treated separately. According to denition equation 5.1, the time-delayed cross-correlations 6 ( 0 C 1) ( i D Cº0 j) can, for t0 ¡ t > 0, be written as ij t, t
( 0 Cº0 ij t, t C 1) X D
fm ( t0 )g
¡ D
¡ ¢ ¡ ¢ P m j ( t0 C 1) D 0 | fm ( t0 ) g, m i ( t) D º P m i ( t) D º, fm (t0 ) g
X
fm (t0 )g
X
fm ( t0 )g
¡ ¢ ¡ ¢ P (m i ( t) D º) P m j ( t0 C 1) D 0 | fm ( t0 ) g P fm (t0 ) g
£ ¤£ ¡ ¢ gj fm ( t0 ) g P m i ( t) D º, fm ( t0 ) g
¡ ¢¤ ¡ P (m i ( t) D º) P fm ( t0 ) g ,
(A.7)
where, in the last equality, equation 2.1 has been used. Note that gj depends on all network states at time t0 , but not on m i ( t) D º, since t < t0 . For t0 D t, a similar equation can be written, where the sum runs over all network states
398
Carsten Meyer and Carl van Vreeswijk
(at time t0 D t) except m i ( t) , which is xed to be º. This eventually leads to the same result. In asynchronous states, the expansion A.3 can be used. Considering all neuron states (at time t0 ) that contribute to the right-hand side of equation A.3 separately (i.e., summing over m D m j ( t0 ) , v D m k ( t0 ), and so on) and integrating out the other neuron states (at time t0 > t), which are not explicitly involved in expansion A.3, one arrives at ( 0 C 1) Cº0 ij t, t D
1 X m D0
C
¡ ¢ ¡ ¢¤ 0 £ gmt P m i ( t) D º, m j ( t0 ) D m ¡ P (m i ( t) D º) P m j ( t0 ) D m 1 X
m ,v D 0
0 gO mt
¡ ¢ ¢ dv,0 ¡ P 0 ( t) X £ ¡ P m i ( t) D º, m j ( t0 ) D m , m k ( t0 ) D v ¡ 1 N 6 j kD
¡ ¢¤ ¡P (m i ( t) D º) P m j (t0 ) D m , m k (t0 ) D v .
(A.8)
Here, the neuron state m i (t ) D º remains in any case since only neuron states at time t0 > t are involved in the summation (or, for t0 D t, the summation 6 i). Note that the term containing the second runs only over neuron states j D t0 O derivative gO 0 does not contribute in leading order 1 / N since it already m j (t )
describes pair correlations of order 1 / N in the synaptic input of neuron j 3 and leads to a term of order N ¡ 2 in equation A.8. ( 0 ) . The second term is exThe rst term in equation A.8 gives Cºm ij t, t panded using equation A.6. As explained in section A.1, in asynchronous v states, the three-point correlation Cºm (the last term in equation A.6) can ijk 3
be neglected because it contributes only a term of order N ¡ 2 . Since
³
1 ¡ 1 X X ¢ dv, 0 ¡ P 0 (t) Pv ( t) D P 0 ( t) 1 ¡ Pv ( t ) vD 0
vD 0
´ D 0,
(A.9)
1 P v ( 00 ) the term proportional to N¡1 D Pv ( t00 ) does not contribute. The 6 j Pk t kD m 0 6 only remaining term proportional to Pj (t ) results for k D i in the crossºv 0 ºv correlation Cik (t, t ) and for k D i in the autocorrelation Ai ( t, t0 ) . The result can be further simplied with the following identity. Since
P (m i ( t) D m ) D D
1 X ºD 0
P (m i ( t) D m , m j ( t0 ) D º)
1 n X ºD 0
mº P (m i ( t) D m ) P (m j (t0 ) D º) C Cij (t, t0 )
D P (m i ( t) D m ) C
1 X ºD 0
mº Cij ( t, t0 ) ,
o
(A.10)
Stochastic Networks of Spiking Neurons
and likewise P (m j (t0 ) D º) D mº Cij satisfy the equality 1 X ºD 0
mº Cij ( t, t0 ) D
1 X m D0
P1
m D 0 P (m i ( t )
399 D m , m j ( t0 ) D º), the correlations
mº Cij (t, t0 ) D 0.
(A.11)
Using this relation, it can be seen that the term proportional to P 0 ( t) does not contribute. Thus, one obtains in a homogeneous network in stable, stationary state, with t D t0 ¡ t > 0, Cº0 (t C
1) D
1 X
ºm
gm C
m D0
(t ) C
³1
X
m D0
´µ
gO m Pm
º0 (t ) C
C
¶ 1 º0 (t ) . A N
(A.12)
Here, we have used that for t ! 1, Cºm ( t, t C t ) ! Cºm (t ) , gmt ! gm , and gOmt ! gO m . For t D 0, a similar derivation leads to the same result. For º, v ¸ 0, Cº(vC 1) (t C 1) can be calculated analogously and obeys Cº
(v C 1)
µ ¶ 1 (t C 1) D (1 ¡ gv ) Cºv (t ) ¡ gO v Pv Cº0 (t ) C Aº0 (t ) . (A.13) N
Equations A.12 and A.13 can be written in matrix form, equation 5.7. This is an inhomogeneous matrix equation from which Cºv (t ) can be inferred for all º, v, and t > 0, if Aºv (t ) and Cºv (0) are given. The equal-time cross-correlations, Cºv (0) , can be calculated in a fashion similar to the time-delayed cross-correlations (involving a product of equation A.3), and using Cºv (0) D Cvº (0) ,
(A.14)
one arrives at equation 5.9. Since the autocorrelations are of order 1, it is seen from equations 5.7 and 5.9 that C (t ) » O (1 / N ). Finally, Av0 (t ) is derived as follows: Av0 (t ) D limt!1 Av0 ( t, t C t ) , is the excess probability that the neuron res at times t ¡ v and t C t and does not re between times t ¡ v and t. Thus, Av0 (t ) is given by Av0 (t ) D
Pv 00 A (v C t ). P0
(A.15)
For a numerical analysis, it is convenient to introduce a cutoff L in the refractory function by Jr (m ) D 0 for all m ¸ L , that is, gm D g :D g[ ( J0 P 0 ¡ # ) ] for all m ¸ L . Then the xed-point activity P 0 can be expressed in terms of the components Pm for the rst L components: 0 · m · L ¡ 1. It is straightforward to replace the innite-dimensional equations 4.2, 5.7, 5.9, and by corresponding equations of nite dimension L , if T and S are
400
Carsten Meyer and Carl van Vreeswijk
modied accordingly. Note that in order to keep the matrix dimension L small (for numerical reasons), the network activity has to be chosen large enough to avoid artifacts arising from the cutoff L ; thus, in section 5.2, the cross-correlations are evaluated for an activity of P 0 D 0.05 D O 50 Hz and a cutoff L D 50. A.3 Continuous Model. The equations for the model in continuous time are derived from a generalized discrete model with time step D t and transition rate w, equation 2.5. Including the postsynaptic response function 2 (m ) , the postsynaptic potential is according to equation 2.7 in a homogeneous network 2.13 given by Z ¡ ¢ J0 X 1 syn (A.16) hi ( t ) D dm 2 (m ) d m j ( t) ¡ m , N ¡ 1 jD6 i 0
yielding for N ! 1 Z syn hhi (t ) i D J0
1 0
dm 2 (m ) Pm (t) D J0 PN (t )
(A.17)
with Z
Pm ( t) D hd (m j ( t) ¡ m ) i
and
PN (t ) D
1
dm 2 (m ) Pm (t ).
0
(A.18)
Expanding the transition rate w ( h ) in a stationary state around the average synaptic input J0 PN similar to equation A.3, the following equation can be derived for the time-delayed cross-correlation (compare Equation A.12): 0
( ) Cº0 ij t, t C D t D
1 X
0
D t wmt Cºm ij
m D0
0
( t, t ) C
³1
X
m D0
´ 0
0
wO mt Pm ( t )
2 3 N 1 6X 7 £ Dk 2 (k ) 4 Cºikk ( t, t0 ) C Aºi k ( t, t0 )5, N k D 1 k D0 1 X
(A.19)
kD 6 i
with and wmt :D w[J 0 PN ( t) ¡ Jr (m ) ¡ #] @w ( h ) . wO mt :D J0 @h h D J0 PN (t ) ¡Jr (m ) ¡#
(A.20)
0 ( 0 ) and Aºv Introducing a density C ºv i ( t, t ) for the cross-correlations and ij t, t autocorrelations, respectively, by
( 0) Cºv ij t, t D D º D v C
0 ºv ( ) ij t, t ,
0 0 ºv Aºv i ( t, t ) D Dº D v Ai ( t, t ) ,
(A.21)
Stochastic Networks of Spiking Neurons
401
and a probability density Pm by Pm D D m Pm , we get for the time-delayed spike-spike-cross-correlation C (t ) :D C 00 (t ) :D limt!1 C 00 ( t, t C t ) in a stable stationary state, Z 1 C (t ) D dm wm C 0m (t ) 0
³Z
C
1 0
´Z dm wO m Pm
1 0
µ ( ) dk 2 k C
0k
(t ) C
¶ 1 0k (t ) A , N
(A.22)
where wm and wO m are the time-independent versions of Equation A.20 (stationary PN ). A 0k (t ) is calculated from » Pk 00 A (t ¡ k ) for k < t P0 A 0k (t ) D (A.23) Pt d (k ¡ t ) ¡ P 0 Pk for k ¸ t. To derive an equation for C º(m C D t ) (t C ij
C
ºm ij
(t ) , we expand the function
ºm D C ij (t ) C
D t)
C
Dt
@ @m
Dt
@ @t
ºm ij
C
C
ºm ij
(t )
(t ) C o ( D t) ,
(A.24)
which is continuous for m > 0, t > 0, and º ¸ 0, and arrive at the integrodifferential equation (t , m > 0) : @ @t
C
0m
(t ) C
@
D ¡wm C
0m
0m
C
@m
(t ) Z
(t ) ¡ wO m Pm
1 0
µ ( ) dk 2 k C
0k
(t ) C
¶ 1 0k (t ) A . (A.25) N
The equal-time cross-correlations are given by Z 1 Z 1 C 00 (0) D dº wº dm wm C ºm (0) 0
C 2s Q
Z 0m
(0) D
1 0
Cs Q
1 0
Z Cs Q2
C
0
Z
1 0
1 0
dm wm
1 0
Z dv 2 (v)
dº wº C Z
Z
ºm
µ dv 2 (v)
C
vm
1 vm (0) C A (0) N
1
µ dk 2 (k ) C
vm
1 vm (0) C A (0) N
0
vk (0)
C
¶
1 vk A (0) N
¶ (A.26)
(0) C µ
dv 2 (v) C
¶ (m > 0)
(A.27)
402
Carsten Meyer and Carl van Vreeswijk @ @º
C
ºm
@
(0) C Z
¡ wOº Pº
1 0
Z ¡ wOm Pm
C
@m
(0) D ¡wº C
µ dv 2 (v) C
1 0
ºm
µ (v) C dv 2
ºm
(0) ¡ wm C
(0) C
1 vm A (0) N
ºv (0) C
1 ºv A (0) N
vm
ºm
(0)
¶ ¶ (º, m > 0) ,
(A.28)
where Z sQ :D
1 0
dm wO m Pm ,
(A.29)
and Avº (0) , the excess probability density of m i (t) D v and m i ( t) D º, is given by
A vº (0) D Pvd (v ¡ º) ¡ PvPº.
(A.30)
Note that with C ºm (0) D C mº (0), it follows from equations A.22, A.27, and A.26, using equations A.23 and A.30, lim C
m !0
0m
(0) D lim C t !0
C
sQ N
00
(t ) D C
00
(0)
Z dº [2 (0) ¡ 2 (º) ]wºPº
in general 6 D
C
00 (0)
.
(A.31)
This is because C 00 (0) is obtained from the probability that in a small time window dt, two neurons re. It is due to the fact that if neuron i res at time t, its input into other neurons changes instantaneously from J02 (m i ( t) ) / N to J02 (0) / N. As a result, C 00 (t ) is different from C 00 (0) even in the limit t ! 0. Only when 2 is continuous at t D 0, 2 ( t) D 0 and 2 is always 0 when the spike occurs, 2 ( t) wt D 0 for all t, do instantaneous jumps in the input not occur. In this case limt !0 C 00 (t ) is indeed equal to C 00 (0) . In the limit dt ! 0, the probability of two spikes occurring at the same time becomes vanishingly small, and this double process should not contribute to the cross-correlations 6 at t D 0. Indeed, its contribution vanishes in the limit dt ! 0 since here it affects only a single point on the continuous-time axis. Acknowledgments
We thank Haim Sompolinsky for many helpful discussions and ideas. Further thanks to David Hansel, Hagai Bergman, and Eilon Vaadia for stimulating discussions. The work of C. M. was supported by the Minerva foundation and the graduate students’ foundation “Organisation and Dynamics of Neural Networks” at the University of Gottingen, ¨ Germany.
Stochastic Networks of Spiking Neurons
403
References Abbott, L. F., & van Vreeswijk, C. (1993). Asynchronous states in networks of pulse-coupled neurons. Phys. Rev. E, 48, 1483–1488. Amit, D. J., & Brunel, N. (1997a). Model of global spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cereb. Cortex, 7, 237–252. Amit, D. J., & Brunel, N. (1997b). Dynamics of a recurrent network of spiking neurons before and following learning. Network, 8, 373–404. Brown, T. H., & Johnston, D. (1983). Voltage clamp analysis of mossy ber synaptic input to hippocampal neurons. J. Neurophysiol., 50, 487–507. Brunel, N., Chance, F., Fourcaud, N., & Abbott, L. (2001). The effect of synaptic noise on ltering of the frequency response of integrate and re neurons. Phys. Rev. Lett., 86, 2186–2189. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrateand-re neurons with low ring rates. Neural Comp., 12, 1621–1671. Cox, D. R. (1962). Renewal theory. London: Methuen. deCharms, R. C., & Merzenich, M. M. (1996). Primary cortical representation of sounds by the coordination of action potential timing. Nature, 381, 610–613. Gerstner, W. (1995). Time structure of the activity in neural network models. Phys. Rev. E, 51, 738–758. Gerstner, W. (2000). Population dynamics of spiking neurons: Fast transients, asynchronous states, and locking. Neural Comp., 12, 43–89. Gerstner, W., & van Hemmen, J. L. (1992). Associative memory in a network of “spiking” neurons. Network, 3, 139–164. Ginzburg, I., & Sompolinsky, H. (1994). Theory of correlations in stochastic neural networks. Phys. Rev. E, 50, 3171–3191. Gray, C. M. (1999). The temporal correlation hypothesis of visual feature integration: Still alive and well. Neuron, 24, 31–47, 111–125. Gray, C. M., Konig, ¨ P., Engel, A. K., & Singer, W. (1989). Oscilliatory responses in cat visual cortex exhibit intercolumnar synchronization which reects global stimulus properties. Nature, 338, 334–337. Horn, D., & Usher, M. (1989). Neural networks with dynamical thresholds. Phys. Rev. A, 40, 1036–1044. Jack, J. J. B., Noble, D., & Tsien, R. W. (1975). Electric current ow in excitable cells. Oxford: Clarendon Press. Perkel, D. H., Gerstein, G. L., & Moore, G. P. (1967a). Neural spike trains and stochastic processes. I. The single spike train. Biophys. J., 7, 391–418. Perkel, D. H., Gerstein, G. L., & Moore, G. P. (1967b). Neural spike trains and stochastic processes. II. Simultaneous spike trains. Biophys. J., 7, 419– 440. Reid, R. C., & Alonso, J. M. (1995). Specicity of monosynaptic connections from the thalamus to visuaocortex. Nature, 378, 281–284. Spiridon, M., & Gerstner, W. (1999). Noise spectrum and signal transmission through a population of spiking neurons. Network: Computation in Neural Systems, 10, 257–272.
404
Carsten Meyer and Carl van Vreeswijk
Stein, R. B. (1967a). Some models of neuronal variability. Biophys. J., 7, 37–68. Stein, R. B. (1967b). The information capacity of nerve cells using frequency code. Biophys. J., 7, 797–826. Stevens, C. F. (1989). How cortical interconnectedness varies with network size. Neural Comp., 1, 473–479. Toyama, K., Kimura, M., & Tanaka, K. (1981a). Cross-correlation analysis of interneuronal connectivity in cat visual cortex. J. Neurophysiol., 46, 191–201. Toyama, K., Kimura, M., & Tanaka, K. (1981b). Organization of cat visual cortex as investigated by cross-correlation techniques. J. Neurophysiol., 46, 202–214. T’so, D. Y., Gilbert, C. D., & Wiesel, T. N. (1986).Relationships between horizontal interactions and functional architecture in cat striate cortex as revealed by cross-correlation analysis. J. Neurosci., 6, 1160–1170. Tuckwell, H. C. (1988). Introduction to theoretic neurobiology. Cambridge: Cambridge University Press. Usher, M., Schuster, H. G., & Niebur, E. (1993).Dynamic populations of integrateand-re neurons, partial synchronization and memory. Neural Comp., 5, 570– 586. Vaadia, E., Haalman, I., Abeles, M., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neural interaction in monkey cortex in relation to behavioral events. Nature, 373, 515–518. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Comput., 10, 1321–1371. Received September 8, 2000; accepted May 22, 2001.
LETTER
Communicated by Misha Tsodyks
Measuring Information Spatial Densities Michele Bezzi
[email protected].it Cognitive Neuroscience Sector, S.I.S.S.A, Trieste, Italy, and INFM sez. di Firenze, 2 I-50125 Firenze, Italy In e s Samengo
[email protected] Cognitive Neuroscience Sector, S.I.S.S.A, Trieste, Italy Stefan Leutgeb
[email protected] Program in Neuroscience and Department of Psychology, University of Utah, Salt Lake City, UT 84112, U.S.A. Sheri J. Mizumori
[email protected] Psychology Department, University of Washington, Seattle, WA 98195, U.S.A. A novel denition of the stimulus-specic information is presented, which is particularly useful when the stimuli constitute a continuous and metric set, as, for example, position in space. The approach allows one to build the spatial information distribution of a given neural response. The method is applied to the investigation of putative differences in the coding of position in hippocampus and lateral septum. 1 Introduction
It has long been known that many of the pyramidal cells in the rodent hippocampus selectively re when the animal is in a particular location of its environment (O’Keefe & Dostrovsky, 1971). This phenomenon gave rise to the concept of place elds and place cells, that is, the association between a given cell and the particular region of space where it res. Many computational models of hippocampal coding for space (Tsodyks, 1999) are based on the idea that the information provided by each place cell concerns whether the animal is or is not at a particular location—the place eld. It is clear, however, that at least in principle, a given cell could provide an appreciable amount of information about the position of the animal without having a place eld in the rigorous sense. For example, it could happen that a cell indiscriminately red all over the environment except in one specic location, Neural Computation 14, 405–420 (2001)
° c 2001 Massachusetts Institute of Technology
406
M. Bezzi, I. Samengo, S. Leutgeb, and S. Mizumori
Figure 1: Firing-rate distribution of four different neurons when a rat is exploring an eight-arm maze. In each case, the density of the color is proportional to the number of spikes per unit time, as a function of space. Since cells of different types have very dissimilar mean ring rates, each plot has been normalized. The absolute rates are shown in Table 1. Each cell provides an amount of information that is at least as large as hIi C sI , where hIi is the mean information provided by the whole set of cells of that particular type, and sI is its standard deviation (see Table 2 for the quantitative details).
where it remained silent. Such a coding would be particularly informative on those occasions when the neuron did not re. More generally, a cell could re throughout the whole environment but with a stable and reproducible distribution of ring rates strongly selective to the position. Place eld coding would mean that such a ring distribution is a very specic one—one where the cell remains silent everywhere except inside the place eld. In this sense, the idea of a place cell is to the coding of position what in other contexts has been referred to as a grandmother cell. Whether a given place cell behaves strictly as a grandmother cell depends on the size of the spatial bins. However, broadly speaking, place cells use a sparse code and respond to only a very small fraction of all possible locations. Figure 1 shows four examples of the ring-rate distribution of four different neurons when a rat is exploring an eight-arm maze. Similar to previous reports (Zhou, Tamura, Kuriwaki, & Ono, 1999), all of these neurons pro-
Measuring Information Spatial Densities
407
Table 1: Data Corresponding to the Cells Whose Firing Density Is Shown in Figure 1. Figure
Type of Neuron
1a 1b 1c 1d
hri (Spikes per Second)
I/t (bits per second)
0.1981 15.8995 2.2688 10.4878
0.7431 2.1941 1.2921 0.4411
Hippocampal pyramidal cell Hippocampal interneuron Lateral septum Medial septum
Note: hri D mean ring rate of the cell, averaged throughout the maze. I / t D corrected information rate (see equations 2.7 and 3.1).
vide an appreciable amount of information about the location of the animal. Table 1 summarizes the data corresponding to the gure. Figure 1a shows a typical place eld; the cell res only when the animal reaches the end point of the right arm. Figures 1b and 1c show two different distributed codes; the rst corresponds to a hippocampal interneuron and the latter to a neuron in the lateral septum that selectively res when the rat occupies the end points of the maze. Finally, Figure 1d shows a cell in the medial septum whose discharge corresponds to all locations within the environment, with a somewhat lower rate in the end points of the maze, specially in the upper arm. These examples show that there might be different coding schemes for position other than localized place elds. Here, we explore such coding strategies in both hippocampal pyramidal neurons and lateral septal cells of behaving rats. The lateral septum receives a massive projection from the hippocampus (Swanson, Sawchenko, & Cowan, 1981; Jakab & Leranth, 1995), which presumably provides information about spatial location. Our aim is to see whether different types of neurons use different codes to represent position. In the following section, the local information is dened, and its relation to previous similar quantities is established. By taking the limit of a very ne binning, such local information gives rise to spatial information density that can be used to explore the coding strategy of a cell. In section 3, the spatial information distribution is calculated for actual recordings in rat hippocampal and lateral septal cells. In section 4 we make some concluding remarks. 2 Stimulus-Specic Informations
Our analysis is based on the calculation of the mutual information between the neural response of each single cell and the location of the animal, ID
XX j
n
µ P ( n, xj ) log2
¶ P ( n, xj ) , P ( n ) P ( xj )
(2.1)
408
M. Bezzi, I. Samengo, S. Leutgeb, and S. Mizumori
where xj is a small region of space and n is the number of spikes red in a given time window. P ( r, xj ) is the joint probability of nding the rat at xj while measuring response n and can always be written as P ( n, xj ) D P (n| xj ) P ( xj ) . The a priori probability P ( xj ) is estimated from the ratio between the time spent at position xj and the total time. The probability of response n reads P (n) D
X
P ( n, xj ) .
(2.2)
j
The mutual information I measures the selectivity of the ring of the cell to the location of the animal. It quanties how much can be learned about the position of the rat by looking at the response of the neuron. In contrast to other correlation measures, its numerical value does not depend on whether the cell res only in a particular location or whether it remains silent there. It may happen, however, that the neural responses are highly selective to some very specic locations and not to others. It is clear that the quantity dened in equation 2.1 provides the total amount of information, averaged over all positions. The scope of this section is to characterize the detailed structure of the spatial locations where the cell is most informative. To do so, we build a spatial information map,that is, a way to quantify the amount of information provided by the cell about every single location xj . This issue has been discussed in De Weese and Meister (1999), although in the context of more general stimuli rather than specically position in space. Two denitions (among innitely many) have been pointed out: the stimulus-specic surprise (which will be addressed here as the position-specic surprise), I1 ( xj ) D
X
µ P ( n| xj ) log2
n
¶ P ( n| xj ) , P ( n)
(2.3)
and the stimulus-specic information (here, position-specic information) I2 ( xj ) D ¡
X n
P (n ) log2 [P(n)] C
X
£ ¤ P ( n| xj ) log2 P ( n| xj ) .
(2.4)
n
Both of these quantities, when averaged in xj , give the total information (see equation 2.1), X j
P ( xj ) I1,2 ( xj ) D I.
(2.5)
However, neither of the two is, by itself, a proper mutual information. The stimulus-specic surprise, equation 2.3, is guaranteed to be positive but may not be additive, while the stimulus-specic information, equation 2.4, is additive but not always positive. Moreover, any weighted sum of I1 and I2 is also a valid estimator of the information to be associated to each location
Measuring Information Spatial Densities
409
(De Weese & Meister, 1999). However, in specic situations, these two local information estimators can be very different, which means that their weighted sum can lead to any possible result. Let us examine the behavior of I and I1,2 in the short time limit. We consider a time interval t and a cell whose mean ring rate at position xj is r ( xj ). Therefore, if t ¿ 1 / r ( xj ), the cell will probably remain silent at xj , seldom ring a spike. The short time approximation involves discarding any response consisting of two or more spikes. It does not mean that such events will not occur, but rather that the set of symbols considered as informative responses are the ring of a single spike, with probability P (1| xj ) ¼ r (xj ) t, and whatever other event, which we call response 0, with probability P (0| xj ) D 1 ¡ P (1| xj ). Therefore, as derived in Skaggs, McNaughton, Gothard, and Markus (1993) and Panzeri, Biella, Rolls, Skaggs, and Treves (1996), » µ ¶ ¼ X hri ¡ r ( xj ) r ( xj ) C C O ( t2 ) , (2.6) IDt P ( xj ) r ( xj ) log2 hri ln 2 j where hri D
X
P ( xj ) r (xj ) .
(2.7)
j
The short time limit of I is much more easily evaluated from recorded data than the full equation, 2.1, since it does not need the estimation of the conditional probabilities. Only the ring rates at each location are needed. The rst term in the curled brackets comes from the ring of a spike in xj , and the second describes the silent response. Similarly, I1,2 tend to » µ ¶ ¼ hri ¡ r ( xj ) r ( xj ) C C O ( t2 ) , (2.8) I1 ( xj ) D t r (xj ) log2 hri ln 2 » ¼ hri ¡ r ( xj ) C r ( xj ) log2 [r( xj ) t] ¡ hri log2 (hrit ) I2 ( x j ) D t ln 2 C O ( t2 ).
(2.9)
Equation 2.8 states that the stimulus-specic surprise also rises linearly as a function of t. Its rst term comes when the cell res a spike, and the second corresponds to the silent response. The stimulus-specic information, on the other hand, diverges. However, since for some stimuli, I2 ( x ) is negative and for some others it is positive, the average of them all is nite, as stated in equation 2.6. The innitely large discrepancy between equations 2.8 and 2.9 shows that for small t, the choice of any one of these estimators is particularly meaningless. Although this procedure is usually referred to as a short time limit, the crucial step in deriving equations 2.6 through 2.9 is to partition the set of all
410
M. Bezzi, I. Samengo, S. Leutgeb, and S. Mizumori
possible responses into two subsets: one containing a single response (the case n D 1) and the complementary one. The conditional probabilities for the occurrence of the distinguished response (n D 1) are taken proportional to a parameter t, which is supposed to be small. Such a procedure with the response variable inspires the exploration of an analogous partition in the set of locations. 2.1 A Spatial Information Density. To nd a well-behaved measure of a location-specic information, we now introduce the local information I`( xj ) , which quanties how much can be learned from the responses about whether the animal is or is not in xj . In other words, we partition the set of possible locations into two subsets: one containing position xj and the complementary set xN j D fx / x does not belong to region jg. Mathematically,
I` ( xj ) D
» µ ¶ P ( xj |n) P ( n ) P (xj |n) log2 P ( xj ) nD 0 µ ¶¼ P ( xN j |n) C P ( xN j |n) log2 , P ( xN j )
C1 X
(2.10)
where P (n ) is the probability of the cell ring n spikes no matter where, P ( xj ) is the probability of visiting location xj , P ( xN j ) D 1 ¡ P ( xj ) is the probability of not being in xj , P ( xj |n) is the conditional probability of being in xj when the cell res n spikes, and P ( xN j |n) is the conditional probability of not being in xj while the cell res n spikes and can be obtained from P (xj |n) C P ( xN j |n) D 1.
(2.11)
Equation 2.10 denes a proper mutual information, in the sense that it is positive and additive. In contrast to the short time limit in the response set, now there is no preferred location to be separated out. That is why we calculate as many I` ( xj ) as there are positions xj . As j changes, however, I`( xj ) refers to a different partition of the environment. This means that one should not average the various I`( xj ). In parallel to the short time limit, we now make the area D of region xj tend to zero. To do so, we assume that both P ( xj ) and P ( xj |n) arise from a continuous spatial density r, Z P (xj |n) D P ( xj ) D
Z
r ( x |n) dx
(2.12)
r ( x ) dx,
(2.13)
region j
region j
Measuring Information Spatial Densities
411
where r (x) D
C1 X
P (n )r (xj |n) .
(2.14)
nD0
For D sufciently small, P (xj |n) ¼ D r ( x |n) P ( xj ) ¼ D r ( x ) .
(2.15)
The continuity of r allows us to drop the subindex j. Expanding I` ( x ) in powers of D , it may be seen that the rst term is » µ ¶ ¼ r ( x |n) r ( x ) ¡ r ( x |n) C I x) D D P ( n ) r ( x |n) log2 r (x ) ln 2 nD 0 C1 X
`(
C O (D 2) .
(2.16)
The local information is therefore proportional to D , which means that in the limit D ! 0, it becomes a differential. Equation 2.16 is completely analogous to equation 2.6. This behavior indicates that the density i (x ) D
I` ( x )
D
(2.17)
approaches a well-dened limit when D ! 0. As pointed out earlier, I` ( x ) is conceptually different from the full information I dened in equation 2.1, and for nite D , there is no meaning in summing together the I` (xj ) corresponding to different j. However, it is easy to verify that when D ! 0, the integral of i ( x ) throughout the whole space coincides with the full information I, when the latter is calculated in the limit of very ne binning. Therefore, i ( x ) behaves as an information spatial density. Moreover, in contrast to I1 ( x) and I2 ( x ) , it derives from a properly dened mutual information. Equation 2.16 is the continuous version of 2.10. It should be noticed, however, that in practice, one cannot calculate r ( x |n) from experimental data for nite time bins. If D is sufciently small and the animal moves around with a given velocity, it never remains within xj during the chosen time window. Nevertheless, there is no inconvenience in giving a theoretical denition of r ( x |n) D limD !0 P (xj |n) / D , imagining one could perform the experiment placing the animal in xj and conning it there throughout the whole time interval. In order to bridge the gap between the theoretical denition of r ( x |n) and the actual possibility of measuring the information spatial density with freely moving animals, we now take equation 2.16 and
412
M. Bezzi, I. Samengo, S. Leutgeb, and S. Mizumori
calculate its short time limit. The result is » µ ¶ ¼ hri ¡ r ( x) r ( x) C I` ( x ) D tDr ( x) r ( x ) log2 hri ln 2 C O (D 2 ) C
O ( t2 ) .
(2.18)
This same expression is obtained if one starts with the full denition, equation 2.10, and rst calculates the limit of t ! 0 and subsequently makes D ! 0. Comparing equation 2.18 with 2.8, it is clear that I` ( x ) D tP ( x ) I1 ( x ) C O (D 2 ) C O ( t2 ) .
(2.19)
We therefore conclude that in the short time limit, the position-specic surprise coincides with the local information (multiplied by the probability of occupancy). This gives the position-specic surprise I1 a different status from I2 or any combination of the two; although in a general situation I1 is not additive, when the number of stimuli is very large (or the binning very ne, if the stimuli are continuous), it coincides with a well-dened quantity—the local information. However, such a correspondence between I`( x ) and I1 ( x ) is valid only in the short time limit—or, more precisely, when computing the information provided by one spike. 3 Data Analysis
In this section, we evaluate I` ( xj ) as a function of xj using electrophysiological data recorded from rodents performing a spatial task. First, the experimental procedure is explained, and then we show that different brain regions use different coding strategies in the representation of space. 3.1 Experiment Design. Nine young adult Long-Evans rats were tested during a forced-choice and a spatial working memory task. Both tasks were performed in an eight-arm radial maze, each arm containing a small amount of chocolate milk in its distal part. In addition, the proximal part of each arm could be selectively lowered, thereby forbidding the entrance to that particular arm (for details, see Leutgeb & Mizumori, 2000). In the forcedchoice task, the animal was placed in the center of the maze, while the entrance to a single arm was available. The other seven arms were kept at a lower level. After the animal had entered the available arm and taken its food reward, a second arm was raised and the previous one lowered. The procedure was continued by allowing the animal to enter just one arm at a time, with no repetitions. The session ended when the rat had taken the milk of all eight arms. A pseudo-random sequence of arms was chosen for each trial. The beginning of the working memory task was identical to the forced-choice one, until the animal had entered the fourth arm in the sequence. At this point, all arms of the maze were raised, and the rat
Measuring Information Spatial Densities
413
could move freely. However, only the four new arms still contained the food reward. The session continued until the animal had taken all the available chocolate milk, or for a maximum of 16 choices. Reentries into previously visited arms of the maze were counted as working memory errors, since in principle, the animal should have kept in mind that the food had already been eaten in that arm. Septal and hippocampal cells were simultaneously recorded during both of the tasks (for recording details, see Leutgeb & Mizumori, 2000). A head stage held the eld effect transmitter ampliers for the recording electrodes as well as a diode system for keeping track of animals’ positions. Single units were separated using an on-line and off-line separation software. Units were then identied according to their anatomical location and the characteristics of the spikes. Hippocampal pyramidal cells and interneurons, as well as lateral and medial septal cells, were identied. Animals were tested in either standard illumination condition or in darkness. 3.2 Results and Discussion. In order to compute I` ( xj ) from the experimental data, both r ( xj ) and P ( xj ) are needed for each position. In order to compute r ( x ) , the total number of spikes red in location xj is divided by the total time spent there. The a priori probability P ( xj ) of visiting the spatial bin j was obtained by the ratio of the time spent in xj and the total duration
of the trial. The computation of mutual information typically introduces an upward bias due to limited sampling. Therefore, a correction has been applied in order to reduce this overestimation, as suggested in Panzeri and Treves (1996). In our case, the rst-order correction for equations 2.6 and 2.10 can be derived analytically, Icorr D I ¡
t (N ¡ 1) , 2T ln 2
(3.1)
where N is the number of positions xj in which the environment has been binned, T is the total duration of the recording, and t the time window used to measure the response. Throughout the article, when specifying experimental data, we always give the corrected value of I, although for simplicity of notation, we drop the subindex “corr.” Table 2 summarizes the overall statistics of our experimental data. The values of I / t have been calculated in the short time limit, equation 2.6, and further subtracting the correction, equation 3.1. The proportionality between I` ( x ) and D (see equation 2.18) was based on the assumption that the conditional probabilities P (xj |n) emerged from a continuous density r ( x |n) . In order to verify whether such a supposition actually holds, we evaluated I`( xj ) from our experimental data for different values of the area D . In Figure 2 we show a spatial average of our results: hI` ( x ) i D
1 X ` I ( xj ) , N j
(3.2)
414
M. Bezzi, I. Samengo, S. Leutgeb, and S. Mizumori
Table 2: Statistic Corresponding to the Whole Population of Recorded Neurons. Type of Neuron HP HI LS MS
Number of Units
hhrii (spikes per second)
shhrii (spikes per second)
hIi / t (bits per second)
shIi / t (bits per second)
114 21 327 34
0.99665 15.228 5.0732 10.971
1.2353 6.9618 7.7119 12.177
0.43851 0.81994 0.25274 0.27802
0.39535 0.85995 0.33575 0.46498
Notes: hhrii D mean ring rate averaged throughout the maze and over all cells. shhrii D average quadratic deviation from the mean. hIi/ t D mean information rate averaged over cells. shIi / t D mean quadratic deviation. HP D pyramidal cells in the hippocampus. HI D interneurons in the hippocampus. LS D units in the lateral septum. MS D units in the medial septum.
where N is the total number of positions j in which the maze has been binned. Clearly, N D total area / D . The local information I` ( xj ) has been evaluated in the short time limit. In all cases, the local information grows linearly with D , as predicted by equation 2.16. We therefore build the local information maps for all the cells recorded. In other words, we calculate i ( xj ) for all the positions xj . We have restrained ourselves from going into too ne a binning, however, in order to avoid limited sampling problems. In Figure 3 we show the information maps corresponding to the ring distributions of Figure 1. The hippocampal pyramidal cell in Figure 3a is informative only at the same location where the cell res. In this case, the intuition seems to be justied: the cell codes for a single position and does so by increasing its ring rate at that location. However, the other three cases show that the neuron may well provide information not only where it res most but also where it res least. In particular, the hippocampal interneuron in Figure 3b, which tends to re all over the maze, is particularly informative in the upper-left and upper-right end points, where it remains almost silent. In Figures 3c and 3d, the cells provide information where there is both a high and a low ring rate. As a consequence, we conclude that if a cell has a distributed coding scheme (as opposed to a very local one, more typical of hippocampal place cells), the information map may well not coincide with the ring-rate one. In this sense, one should beware not to judge cells with a distributed ring pattern as noninformative. If such a pattern is stable and reproducible and covers a wide range of ring rates, the neuron may well be providing spatial information. Could a quantitative analysis of the coding strategies of hippocampal pyramidal cells and neurons in the lateral septum be given? We have not considered hippocampal interneurons or medial septum cells since in these two cases, we do not have enough statistics to draw conclusions. In addition, on average, they are less informative (see Table 2).
Measuring Information Spatial Densities
415 (a) Hippocampal pyramids
0.020
< I (x ) > / t
0.015
0.010
0.005
0.000 0.000
0.004
0.008
0.012
0.016
(b) Lateral septum 0.020
/t
0.015
0.010
0.005
0.000 0.000
0.004
0.008
0.012
0.016
Figure 2: Mean local information rate, dened in equation 3.2, as a function of the area D of each bin. (a) Three pyramidal cells in the hippocampus. (b) Three cells in lateral septum cells. All cells carry an appreciable amount of information when compared to other cells of the same type. Both a and b contain one unit with a high ring rate, an intermediate one, and a low-frequency cell.
One possible measure of the degree of localization of the coding scheme is given by the correlation between the information maps and the ring-rates distributions (that is, between the graphs of Figures 1 and 3). We evaluate the Pearson correlation coefcient between the two maps, P£ CD
j
¤ £ ¤ I` (xj ) ¡ hI` ( xj ) ij r ( xj ) ¡ hri hP i hP i , `( ) ( ) N j I xj j r xj
(3.3)
416
M. Bezzi, I. Samengo, S. Leutgeb, and S. Mizumori
Figure 3: Local information distributions corresponding to the ring densities of Figure 1. In each case, the density of the color is proportional to the number of bits per unit time, as a function of space.
where hI`( xj ) ij is the spatial average of the local information. Thus, C is equal to 1 if I` ( xj ) ¡ hI`( xj ) ij is proportional to r ( xj ) ¡ hri, and takes the value ¡ 1 if the proportionality factor is negative. Notice that there is one such C for every single cell. In Figure 4a we show the frequency of the correlation C for the 114 pyramidal cells recorded in the hippocampus, and in Figure 4b, the 297 units recorded in the lateral septum. Pyramidal cells tend to have larger values of the correlation C, indicating that they tend to provide information in the same locations where they re most. In other words, the code in the hippocampus can be characterized as localized, as is well known, giving rise to place cells and place elds. In contrast, septal cells have a somewhat more symmetrical distribution around zero. If C ¼ 0, then the cell gives as much information in those locations where it res most as where it remains silent (or, more precisely, where it res less than its average spontaneous rate). A negative value of C indicates that the cell is most informative in the locations where it does not re (Figures 1b and 3b provide an example from a hippocampal interneuron). As stated in Table 2, hippocampal pyramidal cells are more informative than lateral septal cells. The point we want to stress is that the lateral septum follows a different coding strategy: cells do not specialize in a particular region of space but rather respond with a complex, distributed ring pattern all over the place.
Measuring Information Spatial Densities
Frequency of Cells
0,3
417
(a)
0,2
0,1
0,0 -1,2
-0,8
-0,4
0,0
0,4
0,8
1,2
0,4
0,8
1,2
C
Frequency of cells
0,3
(b)
0,2
0,1
0,0 -1,2
-0,8
-0,4
0,0
C Figure 4: Frequency distribution of the correlation C between the information and ring-rate spatial distributions for (a) 114 pyramidal cells in the hippocampus and (b) 327 neurons in the lateral septum.
4 Conclusion
The aim of this work was to characterize the way the information provided by neural activity distributes among the elements of a given set of stimuli. In our case, the stimuli were the different positions in which an animal can be located within its environment. We dened the local information I` ( s) — the information provided by the responses of whether the stimulus is or is
418
M. Bezzi, I. Samengo, S. Leutgeb, and S. Mizumori
not s. In other words, it is the mutual information between the responses and a reduced set of stimuli, consisting of only two elements: stimulus s and its complement. In contrast to other quantities introduced previously, this is a well-dened mutual information, which can be employed in the short time limit. In fact, other possible denitions have some drawbacks; for example, the position-specic surprise has the disadvantage of not being additive. From the theoretical point of view, it is therefore not a very sound candidate for quantifying the information to be associated to each stimulus. The position-specic information, on the other hand, may not be positive and diverges for t ! 0, thus making its application to actual data quite cumbersome. In this article, we have studied the properties of I` in the particular situation where the stimuli arise from a continuous variable (as position in space), which has an underlying metric. In this case, the binning that transforms the continuous variable (in our case, x ) into a discrete set ( xj ) may be chosen, in principle, at will. When working with real data, however, the size of the bins is determined by the amount of data, since the mean error in the calculation of the mutual information is linear in the number of bins (see equation 3.1). We have shown analytically that when the size of the bin goes to zero, the local information is proportional to the bin size. This means that I` (x ) / D behaves as an information spatial density, in that it tends to a constant value when D ! 0, and its integral all over space coincides with the full information I. These two properties hold only in the limit of D ! 0, whereas the position-specic surprise and the position-specic information fulll equation 2.5 for any size of the bins. We have also shown that in the short time limit and for D ! 0, the local information coincides with the position-specic surprise, multiplied by the probability of occupancy. This result may seem puzzling, since I1 is known not to be additive, while additivity is guaranteed in I`. However, the additivity of I` is more restricted than the one desired for I1 . If the positionspecic surprise were to be additive, I1 would obey the relation I1 ( xj1 , xj2 ) D I1 ( xj1 ) C I1 ( xj2 | xj1 ) ,
(4.1)
where I1 ( xj1 , xj2 ) is the information provided by the responses about two particular results of the measurement of the stimulus. As shown by De Weese and Meister (1999), I1 does not follow equation 4.1. The local information, on the other hand, does fulll the condition I ` ( x a , x b ) D I ` ( x a ) C I1 ( x b | x a ) ,
(4.2)
where xa and xb may only be xj or xN j , and I` (x a , xb ) represents a true mutual information, between the set of responses and the two sets fxj , xN j g (one set for each measurement). Additivity for I1 requires additivity for any choice
Measuring Information Spatial Densities
419
of xj1 and xj2 in equation 4.1, while the possible values of xa and x b are much more restricted in equation 4.2. One should therefore not mistrust the fact that I1 does not obey a very demanding additivity condition, while I` fullls a quite relaxed one. The local information, as dened here, allows the characterization of the spatial properties of the information conveyed by a given cell. By measuring the mutual information between a given neuronal response and the set of possible locations, one sees that there are cells (in both the hippocampus and in the lateral septum) that provide an appreciable amount of information without actually having a place eld. By means of the local information, it is possible to draw a spatial information density, which may, in these nontypical cases, differ signicantly from the rate distribution. The local information can be calculated easily from experimental data, and it can be used to characterize the coding strategy of different cell types. In particular, we saw that hippocampal place cells tend to provide spatial information in the same places where they re, whereas lateral septal neurons use a more distributed coding scheme. This result is in agreement with the different goals in the encoding of movement-related quantities in both regions, as described recently in Bezzi, Leutgeb, Treves, and Mizumori (2000). Acknowledgments
We thank Bill Bialek and Alessandro Treves for very useful discussions. This work was supported by the Human Frontier Science Program, Grant No. RG 01101998B. References Bezzi, M., Leutgeb, S., Treves, A., & Mizumori, S. J. Y. (2000). Information analysis of location-selective cells in hippocampus and lateral septum. Society for Neuroscience Abstracts, 26, 173.11. De Weese, M. R., & Meister, M. (1999). How to measure the information gained from one symbol. Network, 10, 325–340. Jakab, R., & Leranth, C. (1995). In G. Paxinos (Ed.), The rat nervous system (2nd ed., pp. 405–442). New York: Academic Press. Leutgeb, S., & Mizumori, S. J. (2000). Temporal correlations between hippocampus and septum are controlled by environmental cues: Evidence from parallel recordings. Manuscript submitted for publication. O’Keefe, J., & Dostrovsky, J. (1971). The hippocampus as a spatial map: Preliminary evidence from unit activity in the freely moving rat. Brain Res., 34, 171–175. Panzeri, S., Biella, G., Rolls, E. T., Skaggs, W. E., & Treves, A. (1996). Speed, noise, information and the graded nature of neuronal responses. Network, 7, 365–370. Panzeri, S., & Treves, A. (1996). Analytical estimates of limited sampling biases in different information measures. Network, 7, 87–107.
420
M. Bezzi, I. Samengo, S. Leutgeb, and S. Mizumori
Skaggs, W. E., McNaughton, B. L., Gothard, K., & Markus, E. (1993). An information theoretic approach to deciphering the hippocampal code. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems, 4 (pp. 1030–1037). San Mateo, CA: Morgan Kaufmann. Swanson, L. W., Sawchenko, P. E., & Cowan, W. M. (1981). Evidence for collateral projections by neurons in Ammon’s horn, the dentate gyrus, and the subiculum: A multiple retrograde labeling study in the rat. J. Neurosci., 1, 548–559. Tsodyks, M. (1999). Attractor neural network models of spatial maps in hippocampus. Hippocampus, 9, 481–489. Zhou, T., Tamura, R., Kuriwaki, J., & Ono, T. (1999). Comparison of medial and lateral septum neuron activity during performance of spatial tasks in rat. Hippocampus, 9, 220–234. Received January 31, 2001; accepted May 1, 2001.
LETTER
Communicated by Erkki Oja
Stochastic Trapping in a Solvable Model of On-Line Independent Component Analysis Magnus Rattray
[email protected] Computer Science Department, University of Manchester, Manchester M13 9PL, U.K. Previous analytical studies of on-line independent component analysis (ICA) learning rules have focused on asymptotic stability and efciency. In practice, the transient stages of learning are often more signicant in determining the success of an algorithm. This is demonstrated here with an analysis of a Hebbian ICA algorithm, which can nd a small number of nongaussian components given data composed of a linear mixture of independent source signals. An idealized data model is considered in which the sources comprise a number of nongaussian and gaussian sources, and a solution to the dynamics is obtained in the limit where the number of gaussian sources is innite. Previous stability results are conrmed by expanding around optimal xed points, where a closed-form solution to the learning dynamics is obtained. However, stochastic effects are shown to stabilize otherwise unstable suboptimal xed points. Conditions required to destabilize one such xed point are obtained for the case of a single nongaussian component, indicating that the initial learning rate g required to escape successfully is very low (g D O ( N ¡2 ) where N is the data dimension), resulting in very slow learning typically requiring O ( N 3 ) iterations. Simulations conrm that this picture holds for a nite system. 1 Introduction
Independent component analysis (ICA) is a statistical modeling technique that has attracted a signicant amount of research interest in recent years (for a review, see Hyv¨arinen, 1999). In ICA, the goal is to nd a representation of data in terms of a combination of statistically independent variables. This technique has a number of useful applications, most notably blind source separation, feature extraction, and blind deconvolution. A large number of neural learning algorithms have been applied to this problem, as detailed in Hyv¨arinen (1999). Theoretical studies of on-line ICA algorithms have mainly focused on asymptotic stability and efciency, using the established results of stochastic approximation theory. However, in practice, the transient stages of learning will often be more signicant in determining the success of an algorithm. In this article, a Hebbian ICA algorithm is analyzed, and a solution to the learnNeural Computation 14, 421–435 (2001)
° c 2001 Massachusetts Institute of Technology
422
Magnus Rattray
ing dynamics is obtained in the limit of large data dimension. The analysis highlights the critical importance of the transient dynamics; in particular, an extremely low initial learning rate is found to be essential in order to avoid trapping in a suboptimal xed point close to the initial conditions of the learning dynamics. This work focuses on the bigradient learning algorithm introduced by Wang and Karhunen (1996) and studied in the context of ICA by Hyv¨arinen and Oja (1998), where it was shown to have nice stability conditions. This algorithm can be used to extract a small number of independent components from high-dimensional data and is closely related to projection pursuit algorithms that detect interesting projections in high-dimensional data. The algorithm can be dened in on-line mode or can form the basis of a xedpoint batch algorithm, which has been found to improve computational efciency (Hyv¨arinen & Oja, 1997). In this work, the dynamics of the online algorithm is studied. This may be the preferred mode when the model parameters are nonstationary or the data set is very large. Although the analysis is restricted to a stationary data model, the results are relevant to the nonstationary case in which learning strategies are often designed to increase the learning rate when far from the optimum (Muller, ¨ Ziehe, Murata, & Amari, 1998). The results obtained here suggest that this strategy can lead to very poor performance. In order to gain insight into the learning dynamics, an idealized model is considered in which data are composed of a small number of nongaussian source signals linearly mixed in with a large number of gaussian signals. A solution to the dynamics is obtained in the limiting case where the number of gaussian signals is innite. In this limit, one can use techniques from statistical mechanics similar to those that previously have been applied to other on-line learning algorithms, including other unsupervised Hebbian learning algorithms such as Sanger’s principal components analysis algorithm (Biehl & Schlosser, ¨ 1988). For the asymptotic dynamics close to an optimal solution, the stability conditions due to Hyv¨arinen and Oja (1998) are conrmed, and the eigensystem is obtained, which determines the asymptotic dynamics and optimal learning rate decay. However, the dynamical equations also have suboptimal xed points that are stable for any O (N ¡1 ) learning rate where N is the data dimension. Conditions required to destabilize one such xed point are obtained in the case of a single nongaussian source, indicating that learning must be very slow initially in order to learn successfully. The analysis requires a careful treatment of uctuations, which prove to be important even in the limit of large input dimension. Finally, the simulation results presented indicate that this phenomenon also persists in nite-sized systems. 2 Data Model
The linear data model is shown in Figure 1. In order to apply the Hebbian ICA algorithm, one should rst sphere the data, that is, linearly transform
Stochastic Trapping in On-Line ICA
423
Figure 1: Linear mixing model generating the data. There are M nongaussian independent sources s, while the remaining N ¡ M sources n are uncorrelated gaussian variables. There are N outputs x formed by multiplying the sources by the square nonsingular mixing matrix A ´ [As An ]. The outputs are linearly projected onto the K-dimensional vector y D W T x. It is assumed that M ¿ N and K ¿ N.
the data so that they have an identity covariance matrix. This can be achieved by standard transformations in a batch setting, but for on-line learning, an adaptive sphering algorithm, such as the one introduced by Cardoso and Laheld (1996), could be used. To simplify matters, it is assumed here that the data have already been sphered. Without loss of generality, it can also be assumed that the sources each have unit variance. The sources are decomposed into M nongaussian and N ¡ M gaussian components, respectively: p (s) D
M Y
¡n 2 i
N Y
e 2 p ( n) D p . 2p iD M C 1
pi (si ) ,
iD 1
(2.1)
To conform with the above model assumptions, the mixing matrix A must be unitary. The mixing matrix is decomposed into two rectangular matrices As and An associated with the nongaussian and gaussian components, respectively: µ ¶ µ ¶ s s (2.2) xDA D [A s A n ] D As s C An n.
n
n
The unitary nature of A results in the following constraints: " # [A s A n ] "
ATs ATn
#
ATs ATn
T T D As A s C An An D I ,
[A s A n ] D
"
A Ts A s
ATs An
A Tn A s
ATn An
#
(2.3) µ
D
I 0
0
I
¶ .
(2.4)
424
Magnus Rattray
3 Algorithm
The following on-line Hebbian learning rule was introduced by Wang and Karhunen (1996) and analyzed in the context of ICA by Hyv¨arinen and Oja (1998):
W t C 1 ¡ W t D g¾xt Á ( y t ) T C aW t ( I ¡ (W t ) T W t ) ,
(3.1)
where Á (y t ) i D w ( yti ) is an odd nonlinear function that is applied to every component of the K-dimensional vector y ´ W T x. The rst term on the right is derived by maximizing the nongaussianity of the projections in each component of the vector y . The second term ensures that the columns of W are close to orthogonal so that the projections are uncorrelated and of unit variance. The learning rate g is a positive scalar parameter, which must be chosen with care and may depend on time. The parameter a is less critical; setting a D 0.5 seems to provide reasonable performance in general. The diagonal matrix ¾ has elements sii 2 f¡1, 1g, which ensure the stability of the desired xed point. The elements of this matrix can be chosen either adaptively or according to a priori knowledge about the source statistics. Stability of the optimal xed point is ensured by the condition (Hyv¨arinen & Oja, 1998) sii D Sign(hsi w ( si ) ¡ w 0 ( si ) i),
(3.2)
assuming we order indices such that yi ! § si for i · min(K, M ) asymptotically. The angled brackets denote an average over the source distribution. A remarkable feature of equation 3.2 is that the same nonlinearity can be used for source signals with very different characteristics. For example, both subgaussian and supergaussian signals can be separated using either w ( y ) D y3 or w ( y ) D tanh ( y) , two common choices of nonlinearity. 4 Dynamics for Large Input Dimension
Dene the following two matrices:
R ´ W T As ,
Q ´ W TW .
(4.1)
Using the constraint in equation 2.3, one can show that
y D W T ( As s C An n ) D Rs C z
where
z»N
(0, Q ¡ RR T ).
(4.2)
Knowledge of the matrices R and Q is therefore sufcient to describe the relationships between the projections y and the sources s in full. Although
Stochastic Trapping in On-Line ICA
425
the dimension of the data is N, the dimension of these matrices is K £ M and K £ K, respectively. The system can therefore be described by a small number of macroscopic quantities in the limit of large N as long as K and M remain manageable. In appendix A, it is shown that in the limit N ! 1, Q ! I , while R evolves deterministically according to the following rst-order differential equation, dR T T T D m ¾ (hÁ ( y ) s i ¡ 12 hÁ ( y ) y C yÁ ( y ) iR ) dt ¡ 12 m 2 hÁ ( y ) Á ( y ) T iR ,
(4.3)
with rescaled variables t ´ t / N and m ´ Ng. This deterministic equation is valid only for R D O (1) , and a different scaling is considered in section 4.2, in which case uctuations have to be considered even in the limit. The brackets denote expectations with respect to the source distribution and z » N (0, I ¡ RR T ). The bracketed terms therefore depend on only R and statistics of the source distributions, so that equation 4.3 forms a closed system. 4.1 Optimal Asymptotics. The desired solution is one where as many as possible of the K projections mirror one of the M sources. If K < M, then not all the sources can be learned, and which ones are learned depends on details of the initial conditions. If K ¸ M, then which projections mirror each source also depends on the initial conditions. For K > M, there will be projections that do not mirror any sources; these will be statistically independent of the sources and have a gaussian distribution with identity covariance matrix. Consider the case where yi ! si for i D 1, . . . , min(K, M ) asymptotically (all other solutions can be obtained by a trivial permutation of indices or changes in sign). The optimal solution is then given by R¤ij D dij , which is a xed point of equation 4.3 as m ! 0. Asymptotically, the learning rate should be annealed in order to approach this xed point, and the usual inverse law annealing can be shown to be optimal subject to a good choice of prefactor, ± ² m0 m0 m » as t ! 1 or equivalently g » as t ! 1 . (4.4) t t
The asymptotic solution to equation 4.3 with the above annealing schedule was given by Leen, Schottky, and Saad (1998). Let uij D Rij ¡ R¤ij be the deviation from the xed point. Expanding equation 4.3 around the xed point, one obtains uij (t ) »
K M X X k,n D1 l,m D1
» Vijkl
¡ 12 Xkl hw 2 ( sn ) idnm C
± t ²¡m 0 lkl 0
t
¼ ¡1 (t ) unm 0 Vklnm , (4.5)
426
Magnus Rattray
where
³ Xij D
m 20 ¡m 0 lij ¡ 1
´µ
¶ 1 1 ± t0 ² ¡m 0 lij ¡ . t t0 t
(4.6)
Here, t0 is the time at which annealing begins, and lij and Vijkl are the eigenvalues and eigenvectors of the Jacobian of the learning equation to rst order in m . These are written as matrices and tensors, respectively, because the system’s variables are in a matrix. One can think of pairs of indices (i, j) and ( k, l) as each representing a single index in a vectorized system. The explicit solution to the eigensystem is given in appendix B. From the eigenvalues dened in equation B.4, it is clear that the xed point is stable if and only if the condition in equation 3.2 is met. There is a critical learning rate, m crit 0 D ¡1 / lmax , where lmax is the largest eigenvalue (smallest in magnitude, since the eigenvalues are negative), such that if m 0 < m crit 0 , then the approach to the xed point will be slower than the optimal 1 / t decay. From the eigenvalues given in equation B.4, we nd that lmax D ¡jmin , where ji D sii hsi w ( si ) ¡ w 0 (si ) i, so that m crit 0 D 1 /jmin . As long as m 0 > m crit , then the terms involving t in equations 4.5 and 4.6 will 0 0 be negligible, and the asymptotic decay will be independent of the initial conditions and transient dynamics. Assuming m 0 > m crit 0 and substituting in the explicit expressions for the eigensystem, we nd the following simple form for the asymptotic dynamics to leading order in 1 / t : uij (t ) » ¡
dij t
³
´
m 20hw 2 ( si ) i . 4m 0ji ¡ 2
(4.7)
4.2 Escape from the Initial Transient. Unfortunately, the optimal xed points described in the previous section are not the only stable xed points of equation 4.3. In some cases, the algorithm will converge to a suboptimal solution in which one or more potentially detected signals remain unlearned and the corresponding entry in R decays to zero. The stability of these points is due to the O (m 2 ) term in equation 4.3, which becomes less signicant as the learning rate is reduced, in which case the corresponding negative eigenvalue of the Jacobian eventually vanishes. Higher-order terms then lead to instability and escape from this suboptimal xed point. One can therefore avoid trapping by selecting a sufciently low learning rate during the initial transient. Consider the simplest case where K D M D 1, in which case the matrix R reduces to a scalar: R D R11 and s D s11 . Expanding equation 4.3 around R D 0,
dR 2 2 000 3 5 D ¡ 12 hw ( z ) im R C 16k 4 hw ( z) ism R C O ( R ) . dt
(4.8)
Stochastic Trapping in On-Line ICA
427
Here, k 4 is the fourth cumulant of the source distribution, and the brackets denote averages over z » N (0, 1) . Although R D 0 is a stable xed point, the range of attraction is reduced as m ! 0, until eventually instability occurs. The condition under which one will successfully escape the xed point is found by setting d|R| / dt > 0: s D Sign(k 4 hw 000 ( z) i),
m
106 ) and will result in learning timescales of O ( N 2 ) for O (N ¡1 ) learning rates. Equation 4.3 does exhibit xed points of this type for particular initial conditions. Consider the case K D M D 2 as an example. If initially R11 ’ R21 and R12 ’ R22, then the dynamics will preserve this symmetry until instabilities due to slight initial differences lead to escape from an unstable xed point. This symmetry breaking is necessary for good performance since each projection must specialize to a particular source signal. Sufciently small differences in the initial value of the entries in R typically will occur only for very large N—much larger than the typical values currently used in ICA. A very small learning rate is then required to avoid trapping in a xed point near the initial conditions, as discussed in the previous section. This initial trapping is far more serious than the symmetric xed point since it requires a learning rate of O ( N ¡2 ) in order to escape, resulting in a far greater loss of efciency. In practice, symmetric xed points do not appear to be a serious problem, and we have not observed any such xed points in simulations of nite systems. This may be due to the highly stochastic nature of the initial dynamics, in which uctuations are large compared to the average dynamical trajectory. This is in contrast to the picture for backpropagation, for example, where uctuations result in relatively small corrections to the average trajectory (Barber, Sollich, & Saad, 1996). The strong uctuations observed here may help break any symmetries that might otherwise lead to trapping in a symmetric xed point, although a full understanding of this effect requires careful analysis of the multivariate diffusion equation describing the dynamics near the initial conditions. 5 Simulation Results
The theoretical results in the previous section are for the limiting case where N ! 1. In practice, we should verify that the results are relevant in the case of large but nite N. In this section, simulation evidence is presented
430
Magnus Rattray
1
1
0.5
0.5
0
0
0.5
0.5
1 0
2
4
6
8
10
1 0
1
1
0.5
0.5
0
0
0.5
0.5
1 0
2
4
6
8
10
1 0
1
1
0.5
0.5
0
0
0.5
0.5
1 0
2
4
6
8
10
1 0
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
Figure 3: 100-dimensional data (N D 100) are produced from a mixture containing a small number of uniformly distributed sources. Figures on the left (a–c) are for a single nongaussian source and a single projection (M D K D 1), while gures on the right (d–f) are for two nongaussian sources and two projections (M D K D 2). Each column shows examples of learning with the same initial conditions and data but with different learning rates. From top to bottom: g D 10¡5 (º D 0.1), g D 10¡4 (º D 1) , and g D 4 £ 10¡4 (º D 4) .
demonstrating that the trapping predicted by the theory occurs in nite systems. Figures 3a through 3c show results produced by an algorithm learning a single projection from 100-dimensional data with a single nongaussian
Stochastic Trapping in On-Line ICA
431
(uniformly distributed) component (N D 100, M D K D 1). The matrices A and W are randomly initialized with orthogonal, normalized columns. Similar results are obtained for other random initializations. A cubic nonlinearity is used, and s is set to ¡1, although the adaptive scheme for setting s suggested by Hyv¨arinen and Oja (1998) gives similar results. In each example, dashed lines show the maxima of the potential in Figure 2. Figure 3a shows the learning dynamics from a single run with g D 10¡5 (º D 0.1). The dynamics follows a relatively smooth trajectory in this case, and much of the learning is determined by the cubic term in equation 4.10. With this choice of learning rate, there is a strong dependence on the initial conditions, with larger initial magnitude of R often resulting in signicantly faster learning. However, recall that a high value for R cannot be chosen without prior knowledge of the mixing matrix. Figure 3b shows the learning dynamics with a larger learning rate g D 10¡4 (º D 1) for exactly the same initial conditions and sequence of data. In this case, the learning trajectory is more obviously stochastic and is initially conned within the unstable suboptimal state with R ’ 0. Eventually the system leaves this unstable state and quickly approaches R ’ 1. In this case, the dynamics is not particularly sensitive to the initial magnitude of R, although the escape time can vary signicantly due to the inherent randomness of the learning process. In Figure 3c, the learning dynamics is shown for a larger learning rate g D 4 £10¡4 (º D 4). In this case, the system remains trapped in the suboptimal state for the entire simulation time. The analysis in section 4.2 is strictly valid only for the case of a single nongaussian source and a single projection. However, similar trapping occurs in general, as demonstrated in Figures 3d through 3f. The components of R are plotted for an algorithm learning two projections from 100-dimensional data with two nongaussian (uniformly distributed) components (N D 100, M D K D 2). The different learning regimes identied in the single component case are mirrored closely in the case of this two-component model. 6 Conclusion
An on-line Hebbian ICA algorithm was studied for the case in which data comprise a linear mixture of gaussian and nongaussian sources, and a solution to the dynamics was obtained for the idealized scenario in which the number of nongaussian sources is nite, while the number of gaussian sources is innite. The analysis conrmed the stability conditions found by Hyv¨arinen and Oja (1998), and the eigensystem characterizing the asymptotic regime was determined. However, it was also shown that there exist suboptimal xed points of the learning dynamics, which are stabilized by stochastic effects under certain conditions. The simplest case of a single nongaussian component was studied in detail. The analysis revealed that typically a very low learning rate (g D O (N ¡2 ) where N is the data dimension) is required to escape this suboptimal xed point, resulting in a long
432
Magnus Rattray
learning time of O ( N 3 ) iterations. Simulations of a nite system support these theoretical conclusions. The suboptimal xed point studied here has some interesting features. In the limit g ! 0, the dynamics becomes deterministic, and uctuations due to the stochastic nature of on-line learning vanish. In this case, the suboptimal xed point is unstable, but the Jacobian is zero at the xed point (in the one-dimensional case), indicating that one must go to higher order to describe the dynamics. Standard methods for describing the dynamics of on-line algorithms have all been developed in the neighborhood of xed points with negative eigenvalues and are not applicable in this case (Heskes & Kappen, 1993). Furthermore, stability of the xed point is induced by uctuations. This is contrary to our intuition that uctuations may be benecial, resulting in quicker escape from suboptimal xed points. In the present case, one has precisely the opposite effect: stochasticity stabilizes an otherwise unstable xed point. In similar studies of on-line PCA (Biehl & Schl¨osser, 1998) and backpropagation algorithms (Biehl & Schwarze, 1995; Saad & Solla, 1995a, 1995b), suboptimal xed points have been found that are also stabilized when the learning rate exceeds some critical value. However, the scale of critical learning rate stabilizing these xed points is typically O (N ¡1 ) , much larger than in the present case. Also, the resulting timescale for learning is O ( N 2 ) with a very small prefactor (in practice, an O ( N ) term will dominate for realistic N). These xed points reect saddle points in the mean ow, while here we have a at region, and escape is through much weaker higher-order effects. This type of suboptimal xed point is more reminiscent of those found in studies of small networks, which often have a much more dramatic effect on learning efciency (Heskes & Wiegerinck, 1998). It is unclear whether on-line ICA algorithms based on maximum-likelihood and information-theoretic principles (see, for example, Amari, Cichocki, & Yang, 1996; Bell & Sejnowski, 1995; Cardoso & Laheld, 1996) exhibit suboptimal xed points similar to those studied here. These algorithms estimate a square demixing matrix and require a different theoretical treatment from the projection model considered here, since there may be no simple macroscopic description of the system for large N. Appendix A: Derivation of the Dynamical Equations
From equation 3.1, one can calculate the change in R and Q (dened in equation 4.1) after a single learning step,
DR DQ
D g¾Á ( y ) sT C a( I ¡ Q ) R,
D g¾ ( I C a( I ¡ Q ) ) Á ( y ) y T C g¾yÁ ( y ) T ( I C a( I ¡ Q ) ) C 2a( I ¡ Q ) Q C a2 ( I ¡ Q ) 2 Q C g2 Á ( y ) xT xÁ ( y ) T .
(A.1)
Here, the denition in equation 2.2 and the constraint in equation 2.4 have been used to set xT As D sT . One can obtain a set of differential equations
Stochastic Trapping in On-Line ICA
433
in the limit N ! 1 using a statistical mechanics formulation that has previously been applied to the dynamics of on-line PCA algorithms (Biehl, 1994; Biehl & Schl¨osser, 1998) as well as other unsupervised and supervised learning algorithms (see, e.g., Biehl & Schwarze, 1995; Saad & Solla, 1995a, 1995b; and contributions in Saad, 1998). To obtain differential equations, one should scale the parameters of the learning algorithm in an appropriate way, in particular, g ´ m / N. Typically one chooses a D O (1) , but in order to obtain an analytical solution, it is more convenient to choose a ´ a0 / N before taking N ! 1 and then take the limit a0 ! 1. The dynamics do not appear to be sensitive to the exact value of a as long as a À g, and it is therefore hoped that the dynamical equations are valid for a D O (1), which is usually the case. The learning rate is taken to be constant here, but the dynamical equations are also valid when the learning rate is changed slowly, as suggested for the annealed learning studied in section 4.1. As N ! 1 one nds, dR D m ¾ hÁ ( y ) sT i C a0 ( I ¡ Q ) R , dt dQ D m ¾ hÁ ( y ) y T C yÁ ( y ) T i C m 2 hÁ ( y ) Á ( y ) T i dt C 2a0 Q ( I ¡ Q ) ,
(A.2)
(A.3)
where t ´ t / N is a rescaled time parameter. The angled brackets denote averages over y , as dened in equation 4.2. In deriving equations A.2 and A.3, one should check that uctuations in R and Q vanish in the limit N ! 1. This relies on an assumption that R D O (1) , which may not be appropriate in some cases. For example, in section 4.2, a suboptimal xed p point is analyzed where it is more appropriate to consider R D O (1 / N ) and a more careful treatment of uctuations is required. As a0 is increased, Q approaches I . If one sets Q ¡ I ´ q / a0 and makes the a priori assumption that q D O (1), then, 1 dq D m ¾ hÁ ( y ) y T C yÁ ( y ) T i C m 2 hÁ ( y ) Á ( y ) T i a0 dt ¡ 2q C O (1 / a0 ).
(A.4)
As a0 ! 1, one can solve for q ,
qD
² 1± m ¾ hÁ ( y ) y T C yÁ (y ) T i C m 2 hÁ ( y ) Á ( y ) T i , 2
(A.5)
which is consistent with the a priori assumption. Substituting this result into equation A.2 leads to equation 4.3 in the main text. This is an example of adiabatic elimination of fast variables (Gardiner, 1985, section 6.6) and greatly simplies the dynamical equations.
434
Magnus Rattray
Appendix B: Eigensystem of Asymptotic Jacobian
The Jacobian of dR / dt as m ! 0 is dened (divided by m ), ³ ´ 1 dRij @ . Jijkl D @R kl m dt ¤ m D0
(B.1)
RD R
This is a tensor rather than a matrix because the system’s variables are in a matrix. One can think of pairs of indices (i, j) and ( k, l) as each representing a single index in a vectorized system. If the dynamics is equivalent to gradient descent on some potential function, then the above quantity is proportional to the (negative) Hessian of this cost function. The Jacobian is not guaranteed to be symmetric in the present case, so this will not be possible in general. From equation 4.3 one obtains, ¡ ¢ (B.2) Jijkl D ¡dikdjl ji C 12 jj ¡ 12 dildjkji , with,
( ji D
sii hsi w ( si ) ¡ w 0 ( si ) i for i · min(K, M ) , 0 otherwise.
One must solve the following eigenvalue problem, X Jijkl Vklnm D lnm Vijnm ,
(B.3)
kl
where lij and V klij are the eigenvalues and eigenvectors, respectively. A solution is required for all i · K and j · M in order to get a complete set of eigenvalues, lii D ¡2ji , lij D
V klii D dikdil ,
¡ 12 (ji C jj ) ,
lij D ¡ji ,
Vklij D dikdjl ¡ djkdil
Vklij D dikdjl
lij D ¡(ji C jj ) ,
for j > K,
V klij D jidikdjl C jjdjkdil
for i < j · K, for i > j.
(B.4)
Acknowledgments
This work was supported by an EPSRC award (GR/M48123). References Amari, S.-I., Cichocki, A., & Yang, H. H. (1996). A new learning algorithm for blind source separation. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Neural information processing systems, 8 (pp. 757–763). Cambridge, MA: MIT Press.
Stochastic Trapping in On-Line ICA
435
Barber, D., Sollich, P., & Saad, D. (1996). Finite size effects in on-line learning of multi-layer neural networks. Europhysics Letters, 34, 151–156. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Biehl, M. (1994). An exactly solvable model of unsupervised learning. Europhysics Letters, 25, 391–396. Biehl, M., & Schl¨osser, E. (1998). The dynamics of on-line principal component analysis. Journal of Physics A, 31, L97–L103. Biehl, M., & Schwarze, H. (1995). Learning by on-line gradient descent. Journal of Physics A, 28, 643–656. Cardoso, J.-F., & Laheld, B. (1996). Equivariant adaptive source separation. IEEE Transactions on Signal Processing, 44, 3017–3030. Gardiner, C. W. (1985).Handbookof stochasticmethods. New York: Springer-Verlag. Heskes, T. M., & Kappen, B. (1993). On-line learning processes in articial neural networks. In J. Taylor (Ed.), Mathematical foundations of neural networks (pp. 199–233). Amsterdam: Elsevier. Heskes, T. M., & Wiegerinck, W. (1998). On-line learning with time-correlated examples. In D. Saad (Ed.), On-line learning in neural networks (pp. 251–278). Cambridge: Cambridge University Press. Hyva¨ rinen, A. (1999). Survey on independent component analysis. Neural Computing Surveys, 2, 94–128. Hyva¨ rinen, A., & Oja, E. (1997). A fast xed-point algorithm for independent component analysis. Neural Computation, 9, 1483–1492. Hyva¨ rinen, A., & Oja, E. (1998). Independent component analysis by general non-linear Hebbian-like learning rules. Signal Processing, 64, 301–313. Leen, T. K., Schottky, B., & Saad, D. (1998).Two approaches to optimal annealing. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems, 10. Cambridge, MA: MIT Press. Muller, ¨ K.-R., Ziehe, A., Murata, N., & Amari, S.-I. (1998). On-line learning in switching and drifting environments with application to blind source separation. In D. Saad (Ed.), On-line learning in neural networks (pp. 93–110). Cambridge: Cambridge University Press. Saad, D. (Ed.). (1998). On-line learning in neural networks. Cambridge: Cambridge University Press. Saad, D., & Solla, S. A. (1995a). Exact solution for on-line learning in multilayer neural networks. Physical Review Letters, 74, 4337–4340. Saad, D., & Solla, S. A. (1995b). On-line learning in soft committee machines. Physical Review E, 52, 4225–4243. Wang, L.-H., & Karhunen, J. (1996). A unied neural bigradient algorithm for robust PCA and MCA. International Journal of Neural Systems, 7, 53–67. Received November 8, 2000; accepted May 2, 2001.
LETTER
Communicated by Eric Mjolsness
A Neural-Network-Based Approach to the Double Traveling Salesman Problem Alessio Plebe
[email protected] Angelo Marcello Anile
[email protected] Department of Mathematics and Informatics, University of Catania 8 I-95125 Catania, Italy The double traveling salesman problem is a variation of the basic traveling salesman problem where targets can be reached by two salespersons operating in parallel. The real problem addressed by this work concerns the optimization of the harvest sequence for the two independent arms of a fruit-harvesting robot. This application poses further constraints, like a collision-avoidance function. The proposed solution is based on a self-organizing map structure, initialized with as many articial neurons as the number of targets to be reached. One of the key components of the process is the combination of competitive relaxation with a mechanism for deleting and creating articial neurons. Moreover, in the competitive relaxation process, information about the trajectory connecting the neurons is combined with the distance of neurons from the target. This strategy prevents tangles in the trajectory and collisions between the two tours. Results of tests indicate that the proposed approach is efcient and reliable for harvest sequence planning. Moreover, the enhancements added to the pure self-organizing map concept are of wider importance, as proved by a traveling salesman problem version of the program, simplied from the double version for comparison.
1 Introduction
The traveling salesman problem (TSP) is probably the most famous topic in the eld of combinatorial optimization, and many studies have searched for a reliable solution. In this problem, the objective is to nd the shortest route between a set of N randomly located cities, in which each city is visited only once. The number of possible routes grows exponentially with an increase in the number of cities. This makes an exhaustive search for the shortest route unfeasible, even if there is only a small number of cities. The high level of interest in the TSP has arisen from the wide range of related Neural Computation 14, 437–471 (2001)
° c 2001 Massachusetts Institute of Technology
438
Alessio Plebe and Angelo M. Anile
optimization problems in elds ranging from parts placement in electronic circuits, to routing in communication network systems, to resource planning. (For general overviews of TSP, see (Lawler, Lenstra, Rinnooy Kan, & Shmoys, 1985) and Reinelt, 1994.) However, real problems are seldom dened in terms of just the pure TSP, often involve multiple traveling salesmen planning routes between a single set of cities, and include additional constraints. A well-known case is the vehicle routing problem (VRP), involving the design of a set of minimum cost delivery routes for a eet of vehicles, originating and terminating at a central depot, serving a set of customers (Golden & Assad, 1988; Laporte, 1992; Fisher, 1995). In the multi-vehicle covering tour problem (m-CTP), only a subset of all vertices has to be collectively covered by the m vehicles (Hodgson, Laporte, & Semet, 1998; Hachicha, Hodgson, Laporte, & Semet, 2000). Examples of the many possible additional constraints are time deadlines for serving customers (Thangiah, Osman, Vinayagamoorthy, & Sun, 1993), time windows (Desrochers, Desrosiers, & Solomon, 1992), and equity in customer satisfaction over periodic services (Dell, Batta, & Karwan, 1996). From a theoretical point of view, all such problems could be mapped in an equivalent TSP formulation, but in practice this is not a viable route to an efcient solution. Therefore, there is a vast literature on heuristics tailored to each abstract class in the family of TSP variants. The work presented here considers the case in which there are two salesmen who each visit an independent subset of the cities. As in the TSP algorithm, each city should be visited only once, and each tour should start and end in the same city. In this case, the best solution is the one in which there is both the least difference between the length of the routes taken by the two salesmen and the total distance traveled in visiting all of the cities is shortest. This work has been developed to plan the path of the two arms of a robot that was designed to harvest oranges. The two arms are electrically driven and move along the direction of unit vectors in a spherical coordinate system. The 2-TSP algorithm here described has been successfully implemented in the CRAM-OPR harvesting robot prototype. A general description of this robot and results of harvesting in real conditions is available in Recce, Taylor, Plebe, and Tropiano (1996) and Plebe and Grasso, in press). The 2-TSP algorithm incorporates two of the constraints arising from the robotic application. One is the partition of the space allowed for the two salesmen. The total space where the targets lay is divided into an overlap area, where both salesmen can reach targets, and two disjoint areas, where only one of the salesmen can move. The other constraint is that in the overlap area, the paths planned by the two salesmen cannot cross. Despite being conceived for a real application with specic constraints, the solving algorithm also turns out to be a very promising neural solution to the pure TSP.
An Approach to the Double Traveling Salesman Problem
439
1.1 Neural Network Approach. The idea of investigating neural networks as an alternative to classical heuristics in the operations research domain in solving the TSP problem has gained widespread attention since the early work of Hopeld and Tank (1985) and Durbin and Willshaw (1987) (see Potvin, 1993, for a recent survey). Both pioneer methods basically perform a gradient-descent search of an appropriate energy function. The Hopeld-Tank model suffers from expensive network representation, with a number of nodes proportional to the square of the cities. Despite subsequent efforts in the tuning of the involved parameters (Kagmar-Parsi & Kagmar-Parsi, 1992; Lai & Coghill, 1992) and in several enhancements (Lo, 1992; Van den Bout & Miller, 1989), there is an intrinsic difculty in the quadratic nonconvex formulation of the TSP objective function, with many more local minima than a standard linear formulation (Smith, 1996). Instead, a linear formulation is shown be the energy function of the socalled elastic net paradigm of Durbin-Willshaw, originating from models of the biological visual cortex, which has been adopted and extended by several researchers (Boeres, Carvalho, & Barbosa, 1992; Vakhutinsky & Golden, 1995). Although quite accurate, this method is still computationally expensive. Most recently, several algorithms have relied on the self-organizing feature map (SOM) (Kohonen, 1984, 1990, 1995), which is also a biologically motivated model and apparently shares some similarity with the elastic net but is not mathematically equivalent to an energy-function formalism (Erwin, Obermayer, & Schulten, 1992). First attempts in using SOM to solve the TSP can be found in Fort (1988) and Heuter (1988). In this method, the activity and weights of a set of articial neurons are adapted to match the topological relationship of the targets within the input set by means of iterative competitive learning. The input sample is presented to the neurons, and the connection weights to the neuron with the best match, together with those of neighboring neurons, are modied in the direction that strengthens the match. This process results in a neural encoding that gradually relaxes toward a valid tour. Although in its original form the SOM does not perform very well for the TSP, it is an attractive paradigm, which has been used by several researchers as a basis for more rened schemes: (AngÂeniol, Vaubois, & Texier, 1988; Fritzke & Wilke, 1991; Burke & Damany, 1992; Burke, 1994, 1996; Aras, Oommen, & Altinel, 1999). In general, the relaxation process of the SOM is not sufcient to nd an optimal tour, since in most cases there is a mismatch between the relative spatial distribution of randomly placed neurons and the location of the targets. This drawback has been addressed in the algorithm of AngÂeniol et al. (1988) by a network that starts with just one neuron and neurons are cloned as long as they win on more than one city during the same iteration step. The same algorithm has been extended by Goldstein (1990) to the m-TSP problem.
440
Alessio Plebe and Angelo M. Anile
In the “guilty net” of Burke, the number of neurons is xed, but the function used in the competition is modied with a penalty term for neurons that win too often. Aras et al. (1999) included in the SOM adaptation a mechanism for restoring the centroid of the overall network to the centroid of the targets. 2 The New Heuristics in the 2-TSP
The 2-TSP presented here is an algorithm based on the SOM mechanism, with the addition of two main heuristics. 2.1 Regeneration. One strategy is to combine the competitive relaxation with a mechanism for deleting and creating neurons, referred to as regeneration. The question then becomes when to perform the regeneration process and where to perform it along the tours. In the approach here, the latter question is addressed by increasing the amount of information acquired during the map relaxation process. During each competition step, information on the spatial mismatch between individual targets and neurons is accumulated in local indicators called activity. In addition, in this algorithm, explicit information concerning the history of the competition process is associated with each target not yet assigned. The former question—when to perform the regeneration—is addressed through a continuous monitoring of the tour relaxation, synthesized into two global indicators, called frustration and knowledge. The frustration measure increases when there are local failures of individual neurons to map to a single target and decreases when a nal match is made between a neuron and a target. The knowledge is a measure of the number of stored dependencies between a target and surrounding neurons. These stored dependencies are used to sort out the mismatch between the distribution of targets and neurons in regeneration mode. The value of both of these parameters decreases with the successful discovery of the optimal tour. The rst parameter can be seen as a measure of the need to regenerate the neural network, and the second determines the potential usefulness of the regeneration process. There is no need to perform a regeneration if the frustration is low, and even if it is high, there is no benet if the knowledge is too small to drive the creation and deletion of neurons. 2.2 Competition Strategy. The other new heuristics in the 2-TSP concerns the competition process. Unlike other adaptive neural approaches, the competition rule is based not only on the minimization of the distance between neurons and candidate targets; it also includes the distance to the trajectory connecting the neurons. At any point in the competition process when a segment of the trajectory of the tour is closer to a target than any of the neurons, then the winning neuron is the nearest along that trajectory rather than simply the nearest neuron. This modication brings two major benets in the algorithm: it reduces the
An Approach to the Double Traveling Salesman Problem
441
tangles that can occur in the trajectory, and it ensures that the two tours do not intersect. As a minor effect, this competition process also acts as a limit in the number of allowed candidates in a competition, with a speed benet. Prior work with adaptive networks has also introduced criteria to limit the candidates in the competition process. Fritzke and Wilke (1991) use a local selection of the tour, essentially to achieve better speed. The “guilty net” of Burke (1996) inhibits multiple winning neurons to ensure node separation. In this work, the candidates are also limited to a variable-length section of the trajectory between the neurons. In addition, the orientation of the tour trajectory with respect to the city is also used as a preliminary selection of candidates. 3 Description of the Algorithm
First, a formal denition of the problem dealt by the algorithm will be given. In a two-dimensional (2D) space where the coordinates of the targets are normalized in the interval [0, 1], it is assumed that the common area allowed to both salesmen is dened by two limits on the x dimension. The left salesman extends on the right side up to x2 , and the right salesman extends to the left up to x1 , with x1 < x2 . Given a set of N targets T , let H (T ) be the shortest tour for T and d(H ( T ) ) be its length. H (T ) is a set of N segments hi connecting two targets Ti¡ , TiC , C ¡ with TiC D Ti¡C 1 , TN D T 0 and T iC , Ti¡ 2 T . The double traveling salesman problem addressed here is formally the following: From all pairs of target sets fL , Rg, L
Problem 1 (2-TSP).
satisfy: S
L
T
L hl
T
DT;
R
D ;;
R
D ;,
hr
0 < xl < x2 , x1 < xr < 1,
8hl 2 H ( L ) , 8hr 2 H ( R ) ; 8fxl , yl g 2 hl 2 H ( L ) ;
8fxr , yr g 2 hr 2 H ( R ) .
nd the two sets fL max (d( H (L
2 T , R 2 T that
¤
¤) )
, R¤ g 2 ffL , Rgg such that:
, d( H ( R¤ ) ) ) D min, Rfmax (d(H ( L ) ) , d( H ( R ) ) )g. L
The solution is the shortest path, with the condition that the paths of the two arms do not cross. Since the arms work in parallel, the total time required is the longest time spent by one of the arms. Theoretically, the 2-TSP can be reduced to several TSP problems by assigning each combination of targets to one tour, from a minimum of two targets up to N/2, and the remaining to the other tour. In the worst case,
442
Alessio Plebe and Angelo M. Anile
Table 1: 2-TSP Algorithm. Input: the set of targets T , the limits of the overlap area x1 , x2 T is split in the sets O , D of targets in the overlap and in the nonoverlap area initialize the tours X l , X r with |X l | C |X r | D |T |, U Ã T , default to regeneration mode while
U if
6 ; D
in adaptation mode: choose randomly a target Tj 2 U (in regeneration mode):
else
choose Tj 2 U
with the largest activity ET
identify one section of the tour if Tj 2 D , or two sections if Tj 2 O , and compute from the candidate set C the winner neuron Xw 2 X l [ X r if in regeneration mode: Ew ¸ 0 (Xw is not yet assigned):
if
xw à t w
(Xw is already assigned):
else
ÃU
U else
delete the worst neuron X, and add a new neuron near Xw with x à tw ¡ fTj g
(in adaptation mode):
Xw is a previous winner on Tj for t times:
if
xw à t w , U
ÃU
¡ fTj g
(Xw is a (relatively) newly winner on Tj ):
else
change the weights of the winner x w à x w C (tw ¡ x w )g and of its neighbors with a smaller learning rate update the process parameters K and F, the activity E according to the value of parameters K, F, current g, number of cities with positive activity ET , choose next process mode: adaptation or regeneration Output the tours X
l,
X
r
PN N! this requires i2D 2 ( N¡i )! TSP computations. All of these TSPs should rst be checked for collisions, and the shortest tour that meets the conditions in problem 1 is selected. The approach taken here is to nd an approximate solution that fully complies with the collision-avoidance condition, which is sufciently close to the optimal solution. The algorithm, summarized in Table 1, includes an initialization phase and an iterative process described in details in the following sections. The input of the algorithm is a set of targets T , which is divided according to the horizontal limitation of the two salesmen,
O fT 2 T : x1 < xT < x2 gI
D fT 2 T : xT < x1 _ xT > x2 g,
where xT is the horizontal component of the vector t associated with the target T.
An Approach to the Double Traveling Salesman Problem
443
The 2-TSP works with a xed number of neurons N, equal from the start to the number of targets. In most of the prior work based on SOM neural networks, the network is initialized with a small number of neurons (e.g., one neuron in the work of AngÂeniol et al., 1988), and the number increases during the search for the best tour. Only Burke (1994, 1996) has started with a xed number of neurons that is equal to the number of targets, but has no addition and deletion mechanism. After an initial setup of the two tours, the algorithm is iterated through a process of adaptation and regeneration until all targets are assigned to neurons. Throughout the iteration process, the number of neurons never changes, since the creation and deletion of neurons is balanced. 3.1 The Starting Tours. The initialization process should pose in the space all the neurons organized in two initial left and right tours, X l , X r , with X D X l [ X r , X l \ X r D ;, | X | D | T |. The general principle for initializing the tours in the 2-TSP would be to lay two polygonal tours with the vertex approximating the center of gravity of the targets in some discrete partition of the whole space and with neurons on the edges approximating the target density in the same partition. Calling Ai the areas into which the space related to one tour is partitioned and C the closed polygonal interpolating the center of gravity of targets in Ai , the linear density of neurons on the initial tour, º (s ) , shall approximate the area density % ( A ) of the targets, Z Z Z (3.1) º ( s) ds D % ( A ) dA Ci
Ai
where Ci is the section of C inside Ai . The choice of the partition can vary in granularity and geometry. An efcient solution is to divide each of the two subspaces in radial sectors, from the center of gravity of all targets in the subspace, at xed angular steps. Because of the weak sensitivity of the algorithm on the initial tour shape, a simpler solution is adopted, based on a tessellation of the space into 16 squares of equal area (4 £ 4). The neurons initially are placed into these squares, where the number of neurons is set to be equal to the number of targets contained in the square. The center of gravity (c i ) of the targets is computed for each of the squares, and piecewise linear loops are connected between the squares (see Figure 1). More precisely, neurons in a square i are placed over a line segment in which the end points are b (i ¡ b) ci C (1 ¡ b ) cb I 1 ¡ b C b ( i ¡ b)
b ( f ¡ i) c i C (1 ¡ b ) c f , 1 ¡ b C b ( f ¡ i)
(3.2)
where b is the index of the rst nonempty square backward from i and f the index of the rst nonempty forwards. When all square are populated with
444
Alessio Plebe and Angelo M. Anile
Figure 1: The partition of the problem space into 16 regions and the shape of the two starting tours. The medium-sized dots are the targets (oranges), the larger dots are centroids for each square, and the small dots are the starting locations for the articial neurons.
neurons, equation 3.2 is reduced to b ci C (1 ¡ b ) c i¡1 I b c i C (1 ¡ b ) ci C 1 . If an index is out of the boundary, it will be wrapped around it. These segments are connected, and the neurons are placed equidistant from each other along the segment within each square. The proportion of segments populated with neurons and joint segments between them is ruled by b. For example, with b D 23 , along the line connecting ci and ci C 1 at distance 13 D (c i , ci C 1 ) from c i there will be the end point of the segment carrying neurons of Ai . At distance 23 D ( ci , c i C 1 ) , there will be the starting point of the segment carrying the neurons of Ai C 1 . In between, there will be a joint segment without neurons. (See Figure 1 for an example with 50 targets.) While for a number of conventional TSP heuristics the choice of a good starting tour has an important impact (Perttunen, 1994), experimentation has revealed that for the 2-TSP, the initialization is not critical. Even a very crude initialization, like two squares with neurons placed without any relationship with the target distribution, leads to a worsening in time and quality less than 1% on a 1000-target problem with respect to the solutions just described. The initial assignment of nodes is independent of the size of the mechanical overlap between the two arms, the partition of targets between the two tours will rely on the competition process, and each target initially included in the overlap set O will be presented to both tours for the competition, in the way described below.
An Approach to the Double Traveling Salesman Problem
445
3.2 The Iterative Loop. In the 2-TSP algorithm, one iteration step consists of the presentation of a single target Tj 2 T , the competition for the winner neuron Xw 2 X , followed by either the adaptation of synaptic weights or the neural regeneration process. In each iteration step, the target is selected randomly, and the overall likelihood of selecting a target is independent of its membership in the sets O or D . However, the assignment of common target is more critical, since it is essential to avoid a premature assignment of free neurons in the overlapped area prior to an assessment of the remaining tours. For this reason, there is an explicit alternation in the selection process between distinct targets and common targets:
( Tj,n 2
O, D,
± l m² when D D ; _ n mod 1 C |D|O || D 0,
(3.3)
otherwise
where de is ceiling function. One conventional practice that reduces the time spent in nding the winner neuron for each target is to restrict the search to a local neighborhood of the target (e.g., Fritzke & Wilke, 1991). In our algorithm, there is a modied version of this approach, based on shrinking the neighborhood of a target during the process. The set of neurons within the neighborhood of target Tj on the nth iteration is referred to as Sjk,n , with k 2 fl, rg. If Tj 2 D , there is only one Sjk,n selection; otherwise, if Tj 2 O , the selection is done for the left tour Sjl,n and the right tour Sjr,n .
In order to determine the initial denition of Sjk,n , the central neuron in the selection is found using ( XD
W ( Tj , 1) X Ai
6 6b if W (Tj , 1) D , otherwise
(3.4)
where 6 b is the symbol for an undened winner. If the target has already been selected, there will be a previous winner, given by the function W dened later in equation 3.9, and used as current center in Sjk,n . Otherwise, resorting to the information that Tj belongs to area Ai in the partition of the plane (see section 3.1), neuron XAi is used. XAi is the most central neuron in the segment whose end points are given by equation 3.2. The neighborhood selection spans L neurons backward and forward from the central neuron along the connected chain of articial neurons. The neighborhood set of all target neurons is changed in each iteration step. In particular, the neighborhood set of the nonselected targets always decreases, while the neighborhood of the selected target may increase. More precisely, the value of Lp associated with each target Tp , is updated at each
446
Alessio Plebe and Angelo M. Anile
iteration (n) as Lp, 0 D dN, ¡ ¢ 1 Lp,n D 1 ¡ dN Lp,n¡1 , Lj,n D
Lj, n¡1 C Lj,0 . 2
p D 0..N I
[initial value of L]
6 jI p D 0..N,pD
[decreases L] [increases L]
(3.5)
with d < 12 . At the beginning of the search process, Sjk,n is large and includes many potential neurons within the neighborhood of each target. As the iteration proceeds, the neurons in the tour move closer to the associated targets, and the search gradually becomes more of a local process. Nevertheless, the neighborhood of a target that has been selected but not assigned by the competition is enlarged the next time that it is selected (Lj is increased). Actually, only a subset of the neurons in Sjk,n competes to be closest to Tj , as described in section 3.3. 3.3 The Competition Rule. The orientation of the tour trajectory with respect to the target Tj is used to select possible candidate winner neurons p out of the group dened by Sj . There is a set of candidates C j for which p the distance D ( Xi , Tj ) from target j to the candidate neuron i is measured and another set C js for which the segment distance Ds (Xi Xi C 1 , Tj ) from the segment of the tour connecting two candidate neurons and the target is used. The two sets are constructed according to the sign changes of the derivative of the distance along the trajectory
(
) dD dD ¸ 0^ ·0 Xi 2 Sj : ds ds XiC Xi¡ ( ) dD dD def · 0^ ¸0 . D Xi 2 Sj : ds ds XiC X¡
C
p def j D
C
s j
(3.6)
iC1
As long as the tour trajectory is oriented toward the target, the next neuron is closer to the target, and there is no need to compute the distance. It is necessary to compute the distance only when a change in this orientation occurs. A change in orientation can occur either at a neuron or along a part of the trajectory connecting two neurons. In the rst case it is more appropriate to use the point-to-point distance as a score in the competition for the winner neuron. In the second case, the distance is measured to the segment, and if the segment is closer than the nearest of the two neurons connected by the segment, it is selected as the winner. The effect of including the segment distance in the competition is illustrated in an example in Figure 2. In this case, if the segment distance is not included in the algorithm, the tour becomes longer and tangled.
An Approach to the Double Traveling Salesman Problem
447
X4 X3
X6
X1
X5
X2
target
X4
A
X3
X6 X1
X5 X2
target
B
Figure 2: A typical situation when prior competition rules lead to a tangle in the tour (A). Neuron X5 will win and move toward the current target, together with X4 and X6 (B). With the current algorithm, the winner will be X2 .
In Figure 2A, the closest neuron to the selected target is X5 , but a part of the tour between X2 and X3 , is much closer to the target. However neither X2 or X3 is close enough to win. Without including the segment distances, X5 would win and be moved, together with its neighbors X4 and X6 , toward the selected target (see Figure 2B). Instead, X2 is selected as the winner and is moved toward the target. This safeguard feature inside the competition is one of the reasons that this algorithm can use values of the learning rate much higher than in conventional SOM, as will be shown in section 4.1, with faster convergence. The speed of the algorithm is further increased by computing the distances only when a neuron or a segment could be a winner in the competition. The orientation of the tour in relation to the target is computed at both end points of each segment within the neighborhood Sj of a target. The segments and neurons that participate in the competition are found by computing the sign of the change in distance to the selected target at the end points of the segments. This is most easily calculated by taking the sign of the dot product of the vector from Xi to Xi C 1 along the segment of the tour and the vector
448
Alessio Plebe and Angelo M. Anile
from an end point of the segment to the target,
³ sign
³ sign
´ ± ² dD ´ sign ( x i C 1 ¡ x i ) T ( x i ¡ tj ) C ds Xi
´ ± ² dD ´ sign ( x i C 1 ¡ x i ) T ( x i C 1 ¡ tj ) , ds X¡
(3.7)
iC 1
since, for example, the sign( ( xi C 1 ¡ x i ) T ( xi ¡ tj ) ) is the sign of cos \TXi Xi C 1 . Table 2 shows the possible cases that need to be considered in the evaluation of a neuron in the trajectory, in accordance with equation 3.6. In the rst case, the tour is progressing toward the target, so the neuron is not shortlisted. The trajectory is directed away from the target in the second case. In the third case, the tour has just passed the selected target, and the distance from the nearest point in the trajectory is used, so Xi 2 C js . The fourth case is when the trajectory was previously directed toward the target and p changes on Xi , the distance from its position is used: Xi 2 C j . Note that case p dD > 0 and dD < 0 is physically impossible. All the Xi 2 C j [ C js are C ds ds ¡ Xi
Xi C 1
selected for the nal competition. In this competition, the neuron with the smallest distance D wins. An example of the role of the segment orientation in the competition process is shown in Figure 3. The signs of the derivatives are shown in parentheses, and thin lines show the distances D taken into account. Note that some of these are point-to-point distances, and others are segment distances. This competition process is also part of the mechanism for selecting collision-free tours. Let the two tours X l and X r be in any state of the competition process but not yet colliding, and let Tj be an arbitrary next-selected target. A tour intersection cannot take place if Tj is already included in one of the tours. Figure 4 shows an example in which the current algorithm avoids a collision between the two trajectories that would have occurred without the competition process already described. There is no absolute means to avoid a collision when the contended target is outside both tours. The rule described in equation 3.6 is applied independently on the trajectory of each tour, and at the end, the two results are compared. For an absolute guarantee of collision-free tours, it would be necessary to extend the rule to simultaneously take into account the orientation of the two trajectories with respect to the current target. In practice, the current algorithm is sufcient. It is interesting to note that collision avoidance is included in the algorithm as a soft constraint, without imposing any check or penalty term where crossing of the two tours is mathematically described. This pertains to the self-adaptive nature of the SOM, where an explicit formulation of the desired objective function is not required, and this is another advantage of
An Approach to the Double Traveling Salesman Problem
449
Table 2: Rules for Selecting Candidates for the Competition to the Selected Target.
Number
@D @s X i ¡
@D @s X i C
@D @s Xi C 1 ¡
Distance Type
1
8
·0
·0
None
2
¸0
¸0
8
None
3
8
·0
¸0
Ds
4
·0
¸0
8
Dp
Pattern
Notes: Columns 2–4 are the inputs to the table: the two distance derivatives of the competing neuron and the derivatives of the next neuron on the incoming trajectory. The output, in the fth column, is the distance used as score in the competition, if any. The last column shows an example of the trajectory conguration treated for each case. A distance is calculated only in cases 3 and 4.
450
Alessio Plebe and Angelo M. Anile
Figure 3: An example of the competition process. This is a part of an example tour, in which the thick, solid lines correspond to set segments in the current tour, and there is an articial neuron (not shown) at each end of each segment. The signs of the distance derivatives with respect to the target are shown in parentheses for each of the neurons. Following the rules described in equation 3.6, the distances to only four neurons out of nine are evaluated. Two of the measurements are to segments of the tour, and two of the distances are to articial neurons. The thin lines show the minimum (perpendicular) distance to the evaluated line segments and the evaluated neurons.
the chosen approach. It is known that the most efcient local heuristics for the TSP, like Lin-Kernighan, are not suitable for accepting additional constraints, but also methods based on the explicit list of constraints, like linear and integer programming (Nemhauser & Wolsey, 1988; Junger, ¨ Reinelt, & Rinaldi, 1995), are limited to modeling linear equalities and inequalities. For the collision avoidance, the mathematical representation of the necessary inequality constraints would require a number of binary variables and extra constraints, which result in a signicant increase in model size in terms of variables and constraints. There are several alternative methods for including directly nonlinear constraints, like special branch-and-bound formulations (Hajian & Richards, 1996) and translators from constraint logic to mixed integer programming (Rodosek & Wallace, 1996), which are too expensive in terms of computation for the 2-TSP application. 3.4 Neural Adaptation. In the 2-TSP algorithm, the state of the neurons is not only reected in the weight values; there is also another measure, the
An Approach to the Double Traveling Salesman Problem
451
A
target
B
target
C
target
Figure 4: An example of the mechanism for avoiding a collision between the two tours. (A) A potential initial situation that could lead to a crossing of the two tours. (B) The modication to the tours that would result using conventional competition rules. A neuron of the tour on the right wins and is pushed toward the target, crossing the left tour. Its two neighbors are also moved, as shown by the dashed arrows. (C) The modication of the tours that results using the current 2-TSP competition rule. The winner is part of the left tour, and it is moved toward the target, together with the unassigned neighboring neuron.
452
Alessio Plebe and Angelo M. Anile
activity of the neurons, which is updated during the process. The activity is an accumulation of winning events for the same neuron and is a measure of the local mismatch between neuron and target density, which will be exploited during the regeneration phase. The activity will be high for neurons in an area crowded with targets and remain low where many neurons compete in an area with sparse targets. It is also a kind of energy level of the neuron in the transition from the free state to the state of being assigned to a target. Once the neuron is assigned, it will continue to compete, without the possibility of conquering further targets. Its activity will be still updated, but it will have a negative sign, being actually a negative potential for all the free neurons competing for same targets. At the end of the competition at the nth iteration, the activity E of the winner is updated by the following rule: ( E w,n Ã
Ew,n¡1 C 1 C a Nn , Ew,n¡1 ¡ 1 ¡ a Nn ,
when Ew,n¡1 > 0 . when Ew,n¡1 < 0
(3.8)
The absolute value given by equation 3.8 will be explained in the next subsection; the only effect for the adaptation process is that in the rst case, Xw is a free neuron, and in the second case, it is already assigned. The target activity is updated by replacing the previous value, T Ã |Ew,n |, Ej,n
and the winner is registered in the winner history for Tj , which is a function dened as W ( T, t) : T £ [0, t ] ! X
C f 6 bg
(3.9)
W is a function giving the winner for a target at discrete time lags. t is the delay of drawing on a target, t D 0 being the most recent time that the target has been drawn. Note that t is not an absolute time metric, since drawing events are not synchronous for the different targets. 6 b is the symbol for the indeterminate winner. At iteration n D 0 (at the beginning of the process), W D 6 b, 8T, 8t. t is the memory of W, the longest delay for which the winner is still recorded for every target. W is updated as follows: W (Tj , t) Ã W (Tj , t ¡ 1) ,
W ( Tj , 0) Ã Xw .
t 2 [1, t ] (3.10)
Further updating of the activity of a neuron takes place only if the process is not in the regeneration state. If the following condition is met, W ( Tj , t ) D W (Tj , t C 1) ,
t 2 [0, t ¡ 1],
(3.11)
An Approach to the Double Traveling Salesman Problem
453
Xw is the same winner for the whole history of Tj and is accepted as the node assigned to the target Tj : xw à tj , Ew à ¡E w .
(3.12)
Otherwise, the winner is updated toward the target, using the current learning rate, in a similar fashion to that used in Kohonen self-organizing maps (Kohonen, 1984): xw,n à x w,n¡1 C gn ( tj ¡ x w,n¡1 ) .
(3.13)
The learning rate g is normally reduced at each iteration by ³ gn Ã
gN g0
´ 1/ N gn¡1 .
(3.14)
g 0 is the value at n D 1. The parameter gN is the value g would reach after N iteration steps in adaptation mode. g is also reduced to half its value each time the process moves into the regeneration state. The weights of all of the neurons in the neighborhood region Sj are updated by an amount that depends on the distance of the neuron along the connected segments from the winner neuron. The amount of movement of the ith neighboring neuron is given by ³ xi,n à x i,n¡1 C gn
U N
´ |w¡i| (x i,n¡1 ¡ tj ) ,
(3.15)
where U is the number of unassigned targets. The changes are symmetric in the neighborhood in each direction along the tour away from Xw . This update process is stopped for a particular neuron as soon as the neuron is assigned to a target. The parameterization of the neural adaptation in equations 3.8 and 3.15 normalizes the behavior with respect to N, so that the values of the parameters are almost independent of the size of the 2-TSP problem, and furthermore this ¡stabilizes the relaxation process throughout the search for ¢ a tour. The ratio U in equation 3.15, for example, provides a more global N change in the weights when the tour is still in an early stage and restricts the changes to nearby neurons in the later stages of the search process. 3.5 Neuron Regeneration. In each iteration of the algorithm, the values of two key parameters are used to determine if the search process is in the state of neural adaptation or neural regeneration. These two parameters are called frustration ( F ) and knowledge ( K ) .
454
Alessio Plebe and Angelo M. Anile
Let the following dene the sets:
X
¡ def n D
X
n
fXw,n0 : Ew,n0 ¡1 < 0,
n0 2 [no , n]g
(3.16)
6 def D
6 W ( Tj,n0 , 1) , D fXw,n0 : Xw,n0 D 6 6 b, W ( Tj,n0 , 1) D
n0 2 [no , n]g
D def
X
n
D fXw,n0 : Xw,n0 D W ( Tj,n0 , t) , t 2 [1, t ],
def
TnC D fTj,n0 : W (Tj,n0 , t) D 6 b, t 2 [1, t ],
(3.17) n0 2 [no , n]g
n0 2 [no , n]g.
(3.18) (3.19)
no is the last iteration since the process was in adaptation mode: at n D 0 or immediately after the end of a regeneration state. During neural adaptation, F and K are computed as follows: Fno D 0, |X
n
6 D
|¡
Kn D | TnC | ¡ | X
n
D
|.
Fn D | X
¡ n | C
Kno D 0,
|X
D|
n
2
I
(3.20)
(3.21)
F is increased if the current competition has not improved the tours. This happens in the following cases: every time the winner is already assigned to a different target, equation 3.16, or is different from the previous winner, equation 3.17. On the contrary, it is decreased by successful neural assignments (see equations 3.11 and 3.18). K is simply increased by any competition involving a new target (see equation 3.19), and decreased by an assignment of a neuron to a target. The regeneration state occurs when the following condition is met: F K > c U.
(3.22)
When the conditions described in equation 3.22 switch the process to the regeneration state, the competition process continues, but in these iteration steps, the target Tj is not drawn randomly. Instead, the target with the largest T is selected. The competition calculation nds the winning neuactivity Ej,n ron Xw , and the action performed by the regeneration process depends on the activity Ew of this neuron. If Ew < 0, a real create-and-delete process occurs:
X
ÃX
e C fXg. b ¡ fXg
(3.23)
An Approach to the Double Traveling Salesman Problem
455
e is the worst neuron, using as a metric the activation of unassigned neurons: X e D arg minfEX : EX ¸ 0g. X X
(3.24)
From equation 3.8, EX is basically increased by 1 every time neuron X is the winner; therefore rule 3.24 selects the less successful neuron. There is an additional term a Nn in equation 3.8, typically smaller than 1, and increasing as the algorithm runs. Due to this term, neurons that become inactive earlier will be processed rst. For example, if two neurons have been winners only once, the rst to be removed by the regeneration mechanism will be the oldest winner. e is chosen randomly If there is more than one neuron with E D 0, X between such neurons. b is the newly generated neuron, placed in the same tour X k of Xw , in a X position identied by the indexb õ given by: ( b õ D
w, w C 1,
when D ( Xw¡1 Xw , Tj ) > D ( Xw Xw C 1 , Tj ) when D ( Xw¡1 Xw , Tj ) < D ( Xw Xw C 1 , Tj )
(3.25)
and with weights given by b x D tj .
(3.26)
If the activity of the winning neuron Ew is nonnegative, then it has not been assigned to a target, and the regeneration process assigns it to the current target, with x i à tj , and Ew à 0, and no create-and-delete process takes place. When the process is in the regeneration state, the condition of equation 3.22 is not checked, and the next iteration step remains in the regenerT . ation state, causing the next Tj to be the one with the largest activity Ej,n The normal state of neural adaptation is resumed only when maxfET g · 0, unless the learning rate is below a xed threshold g < g2 , with g2 > gN . This rst of these conditions allows the regeneration process to continue until all of the unassigned targets that have been in a prior competition process are assigned. When the latter learning-rate condition is met, the neural adaptation never resumes, and the regeneration process continues on all of the remaining targets until the end of the search for the tours. This last condition is a very efcient short-cut in the nal stage of the search process, when the neural convergence is very slow and the few targets left may take many iterations to attract neurons. The regeneration process leads to assignment of a target to a neuron at every step and therefore rapidly completes the development of the tour. When the neural adaptation process resumes, frustration F and knowledge k are always set to zero, as in equations 3.20 and 3.21.
456
Alessio Plebe and Angelo M. Anile
4 Results 4.1 Parameter Sensitivity. Several of the mathematical expressions in the algorithm include working parameters 3.2, 3.5, 3.8, 3.9, 3.14, and 3.22. However, the algorithm works reasonably well in a broad range of most of the parameters, and in general the effort of ne tuning will bring a very marginal improvement in terms of tour quality. The same has already been noted for the starting tour (which actually can be interpreted as another parameter choice) in section 3.1. Moreover, the effect of the parameters is almost independent of the problem size, which is automatically accounted, as in equations 3.5, 3.8, 3.14, and 3.22. The learning rate is a classic crucial parameter in neural-based TSP algorithms in balancing the speed of convergence on the nal tour and the avoidance of local minima. Kohonen (1995) suggested in general a decrease from unity in the initial adaptation phase and a low value, like 0.02, during the nal phase. In AngÂeniol et al. (1988), the learning rate is decreased from unity at every iteration using the rule gn à (1 ¡ Cg )gn¡1 , where a typical value of the constant Cg is 0.02; the same value is used in Goldstein (1990). With this setting on a 1000-target run, the average learning rate will be 0.05. In the 2-TSP, the alternative mechanism of regeneration and the selection of candidates, equation 3.6, allow for a higher learning rate to be applied without errors due to local minima. For example, with a typical setting, the 2-TSP will apply an average learning rate of 0.5 when processing a 100-target problem and an average learning rate of 0.6 for a 1000-target problem. Figure 5 shows the performances of the 2-TSP on a 250-target problem, with 20% overlap, at various settings of the learning rate. From equation 3.14, the learning rate is ruled by its initial value, g 0, and the value decayed after N iterations gN . In the gure, g0 is varied in [0.1, 1.2], and the ratio ggN0 spans four decades. The quality of a computed tour is measured as D the gap from the best value, 100( b ¡ 1) , where D is the total distance travD b is the minimum D achieved over all runs, which eled by the tours, and D in this case happens for g 0 D 0.85 and gN D 0.000114. All data points are averages on 1000 runs, since results from one run are always affected by the random sequence of presentation of the targets during the process. There is a constraint placed on the selection of targets from the set D or O that reduces the likelihood of problematic sequences. It is evident in the gure that for a broad range of values ruling g, the surface of the gap is contained within 1% of the best case. The most important parameter for the behavior of the 2-TSP is c , which determines the transition between the regeneration and adaptation states. As is evident in equation 3.22, with lower values of c , the algorithm easily switches into the regeneration state. This results in a reduction of the time required for the search but may lead to a premature assignment of targets and less optimal nal tours.
An Approach to the Double Traveling Salesman Problem
457
0.
gap [%]
1. 2.
2.
log
1.
3.
0 N
0. 1. 0
4.
0.5 0.
5.
gap [%]
4. 3. 2. 1. 0
5 4
(C)
3
(B)
2 1
(A) 0.001
0.1
10.
1000.
0
iterations [# X 1000]
Figure 5: Sensitivity of the 2-TSP quality to different learning-rate parameterization. g 0 is the initial learning rate, and gg0 is the indication of the span of the N learning rate during the iteration.
Figure 6: Results from running the 2-TSP algorithm with different values of c (x-axis) on 250 targets with 20% overlap of the search area of the two tours. (A) Number of iterations performed by the algorithm. (B) Gap of total length of the two tours from the best value. (C) Gap of the mean difference between the two tours. B and C refer to the left-sided y-axis, and A refers to the right-sided one. Each data point corresponds to the average from 1000 different random tours.
Figure 6 shows how the performance of the algorithm depends on changes in the value of c , using the same problem described for Figure 5. Curve A in Figure 6 conrms that when the neural adaptation is the prevailing process (larger values of c ), the convergence of the tours toward the targets requires
458
Alessio Plebe and Angelo M. Anile
many more iterations. Up to c equal to 10, most of the targets are actually assigned during the regeneration state, and since each iteration step in this state results in one assignment, there are nearly 250 steps of regeneration. For values c > 2000, the regeneration is almost inhibited; in other words, the algorithm behaves more and more like an ordinary SOM, apart from the competition rule, and the number of iterations grows exponentially. With a total exclusion of the regeneration mechanism (c ! 1), the nal convergence will not be ensured anymore, which is known from the mathematics of the SOM, as well as from experimental results (Jeffries & Niznik, 1994; Aras et al., 1999). Convergence of the algorithm with the regeneration is guaranteed, since during this process, one city is assigned at each iteration, and in the adaptation state, from equation 3.14, limi!1 gi D 0. Therefore, for a number of iterations i large enough, gi < g2 so that the adaptation process is denitely inhibited, and the algorithm will terminate in regression mode. Curve B in Figure 6 shows the overall optimization of the tours, expressed as percentage gap, as in Figure 5. Curve C shows the effectiveness of the algorithm in balancing the length of the two tours, measured as 100 sbD , D b the where sD is the deviation between left and right tour lengths, and D shortest total length over all trials. Neither parameters varies substantially with c . Note that the best results lay in an area where the computation time is still very close to the lower limit. Other tests have shown that similar behavior occurs with a larger or smaller number of targets. Figure 7 shows an example of the progress of the search process. In particular, it shows the times at which the algorithm switched between the regeneration and competition states. In this case, there are 100 targets, and c is equal to two. The plot of U shows the rate of assignment of the targets as a function of the number of iteration steps. Whenever the amount of frustration and knowledge enable the regeneration to take place (e.g., near step 40), the number of unassigned targets decreases almost linearly. In this case, when c is two, there is no assignment during the neural adaptation state. As a result of applying equation 3.22, in the nal part of the search process, the regeneration state is more likely than the adaptation state. In the example (after step 120), regeneration is the sole process, and there is a target assignment in each iteration step. Figure 8 shows the evolution of the 2-TSP algorithm on a large problem. In this case, there are 2000 targets, and the two tours are completely overlapping. This gure contains the initial conguration, an intermediate step, and the nal tours. 4.2 Performances on Double Problems. Since the 2-TSP has been designed for a specic real application, it is not easily comparable with respect to all its features with other algorithms because m-TSP with additional constraints available in literature is quite different, and none was exactly
An Approach to the Double Traveling Salesman Problem
459
Frustration
Knowledge
Unassigned targets
0
25
50
75
100
125
150
iterations Figure 7: Plot of target assignment during the iteration of the 2-TSP algorithm. The middle panel shows the values of the two controlling parameters frustration and knowledge, and the upper bar shows the state of the process—in black during adaptation and in white during regeneration.
formulated as for this robotic application. However, it is possible to compare the two most important gures, computation speed and tour lengths, with other m-TSP, neglecting other features of the algorithms. A comparison has been done with goldstein , another SOM-based algorithm (Goldstein, 1990), which is a direct extension of the algorithm of AngÂeniol. When a target Tj is selected, the distance from a neuron Xi is weighted by an additional term,
³
¤(
D Tj , Xi ) D D (Tj , Xi ) 1 C
d ( k) ¡ dN dN
´
,
(4.1)
where d ( k) is the total current distance of tour k 2 l, r, and dN is the mean of the two total distances. In this way, neurons on an overloaded tour are less likely to win. Table 3 reports comparisons made on the same two problems used in the original work of Goldstein. The tour quality is slightly better for 2-tsp, but the major difference is in time performances, where goldstein is at least an order of magnitude slower. Both algorithms were run on a 266 MHz
460
Alessio Plebe and Angelo M. Anile
Figure 8: An example 2-TSP result with 2000 targets and 100% overlap of the search space of the two arms. The upper two panels contain the starting tour and one of the intermediate steps. The lower panel shows the two nal tours. Table 3: The Algorithm Compared with Another m-TSP Neural Solver in Double Tour Problems. Problem Size
goldstein
2-tsp
Length
Deviation
Time
Length
Deviation
Time
100
10.33
0.68
127
10.13
0.08
7
200
13.37
0.27
264
12.08
0.24
34
Note: The length is the sum of the two tours, the deviation is between tours, and the running time is in milliseconds.
Pentium II PC with Linux OS, C code compiled using GNU compiler, O1 level optimization, and timing is measured with the getrusage system call. These are the conditions used in all computations in this work. 4.3 Collision Avoidance. The current algorithm does not guarantee that the two tours do not cross. Using a simulation, the probability that the rules given in equation 3.6 result in two tours that cross has been evaluated. A large number of different tours have been run, with sizes of the problem in the range of interest for the harvesting application. Results are shown in Table 4.
An Approach to the Double Traveling Salesman Problem
461
Table 4: Number of Final Tours with Collision over 100,000 2-TSP Runs, for Various Problem Sizes and Conditions. Problem Size
Fraction of Targets in the Overlap Area 0.50
0.70
0.90
10
2
12
5
20
7
8
16
30
12
5
4
40
6
6
8
50
6
12
7
60
6
6
3
Simulations were run using 50% of the total space as overlap area and several distributions of targets in the overlap and disjointed areas. The case with 50% of targets in the overlap area corresponds to the uniform distribution. The number of tours in which the trajectories cross is few then 10 cases over 100,000 runs for most of the conditions. The values shown in the table were computed with a value c D 0.1; other tests were done with c D 2000 to check if the regeneration process had an inuence on the collision avoidance. For such value of c , regeneration is almost inactive, and an inevitable side effect is the increase of computation time of at least a factor of 1000, so only 10,000 runs were performed. For most of the cases, a single collision occurred; two collisions were detected for 60 targets and 0.9 overlap. These numbers are too small for a statistical investigation; however, the avoidance is mainly due to the competition strategy, and the regeneration has no evident inuence. The most important evidence is that collision episodes are very marginal. These trajectory crossings would actually lead to a potential collision if the two arms are simultaneously in the overlapping section of the corresponding tours. A real collision would never occur since there are safety mechanisms in the lowlevel control system of the arms. If a collision condition occurs, the robot controller is forced into a nonnominal feedback system, which protects the arms but takes up valuable time that could be spent in the harvesting process. The very low collision probability makes the avoidance function of the algorithm sufcient for the application. 4.4 TSP Version of the Algorithm. The 2-TSP algorithm has been designed with the robotic application in mind; however, its key mechanisms, like the adaptation-regeneration process and the competition based on trajectories, seem to be an enhancement of the SOM paradigm of a more general signicance. In order to assess the effectiveness of the 2-TSP solutions, the algorithm has been simplied to solve the classical one-TSP problem, where competitors abound. A benchmark set, which is being used more often in
462
Alessio Plebe and Angelo M. Anile
Table 5: Quality of the TSP Version of the Existing Algorithm Compared with Other Neural Algorithms, Run on Benchmark TSP Problems with Known Optimum Solutions. Gap (%)
Problem Name
Size (#)
guilty
bier127
aras
ang
2-tsp
127
31.2
2.76
3.71
5.92
eil51
51
10.6
2.82
3.99
3.05
eil76
76
14.1
5.02
6.13
5.76
eil101
101
22.7
4.61
6.68
6.83
kroA200
200
37.8
5.72
8.50
8.49
lin105
105
7.6
1.98
6.49
4.93
pcb442
442
—
11.07
17.47
11.08
pr107
107
81.7
0.73
1.79
1.93
pr124
124
47.1
0.08
5.12
0.33
pr136
136
40.4
4.53
6.89
2.54
pr152
152
42.8
0.97
1.30
1.17
rat195
195
65.6
12.23
15.41
9.25
rd100
100
10.4
2.10
4.50
4.07
70
12.0
1.48
2.67
1.19
Average
32.6
4.00
6.47
4.79
Standard deviation
23.0
3.66
4.7
3.27
st70
Note: Quality is measured as the percentage gap from the known optimum.
the TSP community, is the TSPLIB (Reinelt, 1991), a growing collection of sample instances. Recent work (Aras et al., 1999) has included, for the rst time, a comparison of neural TSP solutions on some TSPLIB benchmarks. Therefore, runs of the 2-TSP have been performed on the same instances. 1 Results are given in Table 5. All 14 problems have a known optimum tour; therefore, the output is given in terms of gap from the optimum value. The compared algorithms are Burke’s “guilty-net” guilty , AngÂeniol’s SOM extended by Aras et al. aras, the original AngÂeniol’s SOM ang, and the 2-TSP 2-tsp. All algorithms has been described in the Introduction and compared using best results, as in the original work. The guilty algorithm clearly has a lower-quality output, of about an order of magnitude worse
1 For clarity, the algorithm of this work will be always called 2-TSP, even when applied to a “single” problem.
463
IN
NN
CH
LK
70
2O
An Approach to the Double Traveling Salesman Problem
runs [#]
60 50 40 30 20 10 0
5
10
15
20
25
30
35
40
45
gap [%] Figure 9: Distribution of TSP tour lengths from 500 runs of the standard problem p654 . These results are compared with Reinelt’s (1994) measurements of the performance of several methods for solving the TSP: NN : Nearest neighbor, IN : Insertion, CH : Christodes, 2O : 2-Opt, LK : Lin-Kernighan.
than the other neural approaches. It does not converge for problem pcb442 , so statistics for this algorithm have been computed on one fewer sample. The aras, ang, and 2-tsp are of similar quality, with the best average quality for aras (4.00), followed by 2-tsp ( 4.79) and ang (6.47) . For the 2-tsp, the quality is less affected by the specic instance of the problem, as indicated by the standard deviation. 4.5 Comparison with Nonneural Solutions. It is well known that neural network approaches to the TSP, while of theoretical importance in developing and demonstrating capabilities of neural models, are reputed as being far less effective compared to operational research state of the art. Therefore, comparisons with classical operational research methods are necessary in assessing the progress of neural network approaches. Figure 9 shows the results of running the TSP version of the algorithm 500 times on one of the benchmark problems, p654, compared with several other classical heuristics of the operational research domain (Reinelt, 1994). The Nearest-Neighbor (NN ) and Insertion (IN ) methods successively build a tour according to some construction rules (Rosenkrantz, Stearns, & Lewis, 1977). In the NN algorithm, a new node is added at every step by simply choosing the nearest target that is not yet assigned, and the tour is closed at the end by connecting the last and rst nodes. The results from an implementation of the NN algorithm shown in Figure 9 use an additional heuristic to avoid a characteristic problem with NN . At the end of the search
464
Alessio Plebe and Angelo M. Anile
with the NN algorithm, isolated targets (the so-called forgotten nodes) have to be connected at a high cost. The IN algorithm starts from a small tour, and at every iteration, an edge is broken to insert a new node. In the version of the algorithm presented in the gure, the target is inserted whose minimal distance to the current tour is maximal. Moreover, the candidate set is reduced to the targets connected to the tour by a subgraph edge. If this set is empty, an arbitrary target is inserted. The Christodes (CH ) heuristics (Christodes, 1976) is one of a number of methods that use a minimum spanning tree (the shortest tree connecting all targets and without cycles) as a basis for generating tours. In the CH algorithm, a Eulerian graph (a graph connecting all targets, not necessarily with unique paths) is built from the minimum spanning tree by connecting every odd-degree node using exactly one edge. Then multiple paths are eliminated to get the nal tour. The 2-opt algorithm (2O ) is a general and widely used improvement principle that consists of eliminating two edges and reconnecting the two resulting paths in a different way. The shortest path between the original and the new obtained tour is selected at each iteration. In this algorithm, the initial tour must be constructed using a different technique. In its simplest form, the 2O algorithm can be iterated on every node with a certain sequence, until the tour stops improving. This requires a very large and unbounded number of iterations. For this reason, the number of 2O iteration steps is restricted using some criteria, such as like on the k-nearest neighbor subgraph, (a graph built on the nodes of the tour by connecting to each target its k-nearest neighbors), but this can reduce the quality of the nal tour. In the case presented in Figure 9, the starting tour is obtained with NN , and the termination of the 2O algorithm is not restricted. A more exible set of heuristics is found in the LK algorithm, which was originally developed by Lin and Kernighan (1973). This algorithm accepts 2O moves that temporarily increase the length of the tour. This is clearly a way to avoid local minima, but it may considerably increase the running time. In practical implementations of LK , a move is dened as a sequence of simple submoves, like 2O, and the number of submoves in a move, and the alternatives for each submove, are limited (Mak & Morton, 1993). Despite their age, Lin-Kernighan-based algorithms are still the most successful tournding approaches (Codenotti, Manzini, Margara, & Resta, 1996; Johnson & McGeoch, 1995; Helsgaun, 2000). The solution compared here has moves with a maximum of 15 submoves and up to two alternatives for each of the rst three submoves. The starting tour was again obtained with the NN algorithm. 4.6 Time Performances. The speed of the algorithm in the simple TSP version has been tested and compared with neural and classical solutions using the benchmarks of the TSPLIB library, now with a range of problem size up to 13,509. The Lin-Kernighan heuristic has been at the heart
An Approach to the Double Traveling Salesman Problem
465
of many solvers, with different designs and performances. The version linkern compared here is a state-of-the-art implementation including several enhancements for speed efciency (Helsgaun, 2000). For the neural solutions, comparison has been made with an improvement of the DurbinWillshaw elastic net algorithm elnet, which runs about 15 times faster than the original one (Vakhutinsky & Golden, 1995), and with ang. No original time performances are available for aras; however, from the description of the algorithm, we believe it cannot be faster than ang, since it is based on ang itself with an added mechanism counteracting the SOM relaxation. The runs of elnet in the original work were computed on a Sparc-10 machine and have been downscaled by a factor of 3.0, a gure derived by several tests of the 2-tsp algorithm on the 266 MHz Pentium II and on a Sparc-10. In the original work of Helsgaun, linkern was run on a 300 MHz G3 Power Macintosh, which is about the same class of the 266 MHz Pentium II, so data have been left unchanged. AngÂeniol’s SOM algorithm ang has been run on the same 266 MHz Pentium II machine as the 2-TSP. Figure 10 presents the comparison. The top graph spans all the problem sizes, and the bottom graph is a zoom close to the origin. The data points correspond to the 23 benchmarks: pr76, pr107, pr124 , pr136, pr152, pr226, pr264, pr299, pr439, pr1002 , pr2392 , kroA200 , pcb442 , rl1304 , rl1323 , rl1889 , pcb3038 , fl417, fl3795 , fnl4461 , rl5915 , rl5934 , usa13509 . For the elnet, the original data points were used—19 random problems of sizes from 50 to 500. The 2-tsp is clearly much faster than the compared algorithm in the range of problem sizes used for the evaluation. The time required is almost a function of the problem size, while for both linkern and ang, one problem may require more running time than a larger one. It has to be noted that linkern has impressive quality performances; even on the larger problem, with 13,509 cities, the gap is 0.008%. For this problem, the gap of the 2-tsp was 15%, solved in 6 minutes against the 13 hours of linkern . Gaps of ang are of the same order of magnitude of the 2-tsp, in accordance with the previous analysis, while computation time is quite a bit longer. For 2-tsp, the running time is a function of the problem size and is weakly affected by the problem typology. On the contrary, for both linkern and ang , one problem may require much more running time than a larger one depending on the target typology. The curves are tted with N a functions in order to express the asymptotic time complexity of the algorithms. Values of a, for elnet, linkern , ang, and 2-tsp, are of 1.7, 2.4, 2.0, and 2.1, respectively. This complexity would suggest that elnet will perform better on larger problems (as reported by the authors); however, data for this algorithm are available only up to 500 cities, and it is not known if a similar gure will be maintained over a wider range. Also, the 2-tsp exhibits a lower complexity when tting
466
Alessio Plebe and Angelo M. Anile
Figure 10: Comparison of time performance on 23 TSP benchmarks for the following algorithms: linkern, curve A and star marks; elnet, curve B and cross marks; ang, curve C and triangle marks; 2-tsp, curve D and diamond marks. The plot on the top spans all the size of the TSP problems; the plot on the bottom is a zoom in the smallest problems, up to 500 targets.
data in a limited range of sizes—for example, up to 4000 cities, the t is N1.8 . In the regeneration process, it is possible to compute the theoretical complexity, since one target is assigned in each iteration, and during each iteration, a number of scans proportional to the total number of targets is performed. Therefore, total time complexity becomes O (N 2 ). There are a couple of mechanisms reducing this gure: the rule selecting candidates (see equation 3.6) and the reduction of the scan section during the process
An Approach to the Double Traveling Salesman Problem
467
(see equation 3.5), explaining the measured complexity less than quadratic in the range up to 4000 targets. During the adaptation process, complexity cannot be computed theoretically, since it is not only dependent on the problem size, and this portion of the overall process accounts for the higher measured complexity in the full range of problem sizes. Also for linkern , the theoretical assessment of time complexity is difcult; a typical reported value is O ( N 2.2 ) (Helsgaun, 2000). However, taking for granted the above complexity over a wide range of sizes, elnet would outperform ang on problems larger than 50,000 and 2-tsp only for problems with more than 3 £ 107 cities. The 2-tsp is clearly much faster than the compared algorithm, and the time required is almost a function of the problem size, while for both linkern and ang, one problem may require much more running time than a larger one. linkern gives impressive performances; even on the larger problem, with 13,509 cities, the gap is 0.008%. For this problem, the gap of the 2-tsp was 15%, solved in 6 minutes against the 13 hours of linkern . Gaps of ang are of the same order of magnitude as the 2-tsp, in accordance with the previous analysis, and computation time is quite a bit longer. 5 Conclusion
The 2-TSP algorithm described here, based on neural network concepts, proved to be a reliable yet simple solution to the 2-TSP problem. In the case of the harvesting robot application, the algorithm satises the real constrains of the working space and collision avoidance of the two arms. Thanks to its simplicity and limited memory requirements, it is suitable for implementation on embedded real-time hardware without the need of powerful general-purpose computer hardware. The evaluation of the algorithm revealed that there was no substantial dependence on the particular values of the parameters of the algorithm and no signicant discrepancy between the speed of the algorithm and the optimality of the resulting tours. The mechanics of the harvesting robot, with two telescopic arms in spherical coordinates, allows a simple projection of the path optimization problem in a 2D Euclidean space, which was used during the algorithm description. Multi-arm robots with different joint geometry in general require higherdimensional space for representing their path planning, especially with respect to collision avoidance. Furthermore, time optimization in robotics involves joint dynamics as well, requiring non-Euclidean space representations, typically Riemaninann (Shin & Mckay, 1985). While in principle the algorithm can be directly extended to any space representation, some of its features will certainly not work anymore (like the collision avoidance), and its overall efciency may not be as valid as in Euclidean 2D space. While the 2-TSP algorithm was intended initially for the harvesting application, a TSP variant of the algorithm performed well in comparison with
468
Alessio Plebe and Angelo M. Anile
the prior solutions to the TSP problem, especially as a quick nder of tours where very high quality is not the main requirement. In particular, several concepts of the 2-TSP algorithm, like combining neural adaptation and regeneration and taking into account the trajectory distance in the neural competition, are very general in nature and proved to enhance the SOM paradigm considerably, which appears to be the most promising neural strategy for solving TSP-class problems. Appendix: List of Symbols
C D H L O R S T U X D Dp Ds E ET F K L N O T U W X h j i k n s w c t x
a
set of neuron candidates for a target competition set of targets in disjointed areas not yet assigned shortest tour targets on the left tour set of targets in the overlap area not yet assigned targets on the right tour set of neurons in the neighborhood of a neuron set of all targets set of targets not yet assigned set of neurons Euclidean distance function Euclidean distance from a point to the target Euclidean distance from a segment to the target activity of a neuron activity of a target frustration knowledge size of the neighborhood of a neuron total number of targets and number of neurons number of targets in the overlap area target number of unassigned targets winner history function neuron segment of a tour current target index current neuron index (unless otherwise stated) tour index 2 f l, r g current iteration index length variable along the trajectory index of current winner neuron centroid of targets in a space subregion (vector of coordinates) vector of target coordinates vector of neuron weights decay of neural activity
An Approach to the Double Traveling Salesman Problem
b d c g g0 gN º %
t
469
weighting factor for the tour initialization decay of neuron neighborhood size threshold for regeneration or adaptation state change learning rate initial learning rate learning-rate decay linear density of neurons in the tour area density of targets longest time lag of the winner history
Acknowledgments
This work was supported in part by the European Community ESPRIT Project 6715 CONNY (Robot Control based on Neural Network Systems). References AngÂeniol, B., Vaubois, G. d. l. C., & Texier, J.-Y. L. (1988). Self-organizing feature maps and the travelling salesman problem. Networks, 1, 289–293. Aras, N., Oommen, B. J., & Altinel, I. K. (1999). The Kohonen network incorporating explicit statistics and its application to the traveling salesman problem. Neural Networks, 12, 1273–1284. Boeres, M., Carvalho, L., & Barbosa, V. (1992). A faster elastic-net algorithm for the travelling salesman problem. In Proc. IJCNN-92 (Baltimore). Burke, L. (1994). Neural methods for the traveling salesman problem: Insights from operations research. Neural Networks, 7(4), 532–541. Burke, L. (1996). Conscientious neural nets for tour construction in the traveling salesman problem: The vigilant net. Computers and Operations Research, 23(2),121–129. Burke, L., & Damany, P. (1992).The guilty net for the traveling salesman problem. Computers and Operations Research, 19, 255–265. Christodes, N. (1976). Worst case analysis of a new heuristic for the travelling salesman problem (Research Rep.). Pittsburgh: Carnegie-Mellon University. Codenotti, B., Manzini, G., Margara, L., & Resta, G. (1996). Perturbation: An efcient technique for the solution of very large instances of the Euclidean TSP. INFORMS Journal on Computing, 8(2), 125–133. Dell, F. R., Batta, R., & Karwan, M. H. (1996). The multiple vehicle TSP with time windows and equity constrains over a multiple day horizon. Transportation Science, 23(2), 121–129. Desrochers, M. J., Desrosiers, J., & Solomon, M. (1992). The vehicle routing problem: An overview of exact and approximate algorithms. Operations Research, 40, 342–354. Durbin, R., & Willshaw, D. (1987). An analogue approach to the travelling salesman problem using an elastic net method. Nature, 326, 689–691. Erwin, E., Obermayer, K., & Schulten, K. (1992). Self-organizing maps: Ordering, convergence properties and energy functions. BiologicalCybernetics,67, 47–55.
470
Alessio Plebe and Angelo M. Anile
Fisher, M. L. (1995). Vehicle routing. In M. Ball, T. L. Magnanti, C. L. Momma, & G. L. Nemhauser (Eds.), Handbook in operations research and management science. Amsterdam: North-Holland. Fort, J. C. (1988). Solving a combinatorial problem via self-organizing process: An application of Kohonen-type neural networks to the travelling salesman problem. Biological Cybernetics, 59, 33–40. Fritzke, B., & Wilke, P. (1991). A neural network for the travelling salesman problem with linear time and space complexity. In Proc. IJCNN-91 (Singapore). Golden, B. L., & Assad, A. A. (1988). Vehicle routing: Methods and studies. New York: North-Holland. Goldstein, M. (1990). Self organizing feature maps for the multiple travelling salesmen problem. In Proc. INNC ’90 (Paris). Hachicha, M., Hodgson, M., Laporte, G., & Semet, F. A. (2000). Heuristics for the multi-vehicle covering tour problem. Computers and Operations Research, 27, 29–42. Hajian, M. T., & Richards, E. B. (1996). Introduction of a new class of variables to discrete and integer programming problems. In Proc Intern. Symposium on Combinatorial Optimization (London). Helsgaun, K. (2000). An effective implementation of the Lin-Kernighan traveling salesman heuristic. European Journal of Operational Research, 126, 106– 130. Heuter, G. J. (1988).Solution of the travelling salesman problem with an adaptive ring. In Proceedings of the IEEE International Conference on Neural Networks. Hodgson, M., Laporte, G., & Semet, F. A. (1998). A covering tour model for planning mobile health care facilities in Suhum district, Ghana. Journal of Regional Science, 38, 621–638. Hopeld, J. J., & Tank, D. (1985). Neural computation of decisions in optimization problems. Biological Cybernetics, 4, 141–152. Jeffries, C., & Niznik, T. (1994). Easing the conscience of the guilty net. Computers and Operations Research, 21(4), 961–968. Johnson, D. S., & McGeoch, L. A. (1995). The traveling salesman problem: A case study in local optimization. In L. E. H. Aarts & J. K. Lenstra (Eds.), Local search in combinatorial optimization. New York: Wiley. Junger, ¨ M., Reinelt, G., & Rinaldi, G. (1995). The traveling salesman problem. In M. Ball, T. L. Magnanti, C. L. Momma, & G. L. Nemhauser (Eds.), Handbook in operations research and management science. Amsterdam: NorthHolland. Kagmar-Parsi, B., & Kagmar-Parsi, B. (1992). Dynamical stability and parameter selection in neural optimization. In Proc. IJCNN-92 (Baltimore). Kohonen, T. (1984). Self-organization and associative memory. Berlin: SpringerVerlag. Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE, 78, 74–90. Kohonen, T. (1995). Self-organizing maps. Berlin: Springer-Verlag. Lai, W. K., & Coghill, G. C. (1992). Genetic breeding of control parameters for the Hopeld-Tank neural net. In Proc. IJCNN-92 (Baltimore). Laporte, G. (1992). The vehicle routing problem: An overview of exact and approximate algorithms. European Journal of Operational Research, 59, 345–358.
An Approach to the Double Traveling Salesman Problem
471
Lawler, E. L., Lenstra, J. K., Rinnooy Kan, A. H. G., & Shmoys, D. B. (1985). The travelling salesman problem: A guided tour of combinatorial optimization. Chichester: Wiley. Lin, S., & Kernighan, B. W. (1973). An effective heuristic algorithm for the travelling salesman problem. Operations Research, 21, 498–516. Lo, J. T. H. (1992). A new approach to global optimization and its application to neural networks. In Proc. IJCNN-92 (Baltimore). Mak, K. T., & Morton, A. J. (1993). A modied Lin-Kernighan for the travelling salesman heuristic. Operations Research Letters, 13, 127–132. Nemhauser, G. L., & Wolsey, L. A. (1988). Integer and combinatorial optimization. Chichester: Wiley. Perttunen, J. (1994). The vehicle routing problem: An overview of exact and approximate algorithms. Journal of the Operational Research Society, 45, 1131– 1140. Plebe, A., & Grasso, G. (in press). Localization of spherical fruits for robotic harvesting. Machine Vision and Applications. Potvin, J. Y. (1993). The travelling salesman problem: A neural network perspective. ORSA Journal on Computing, 5, 328–348. Recce, M., Taylor, J., Plebe, A., & Tropiano, G. (1996). Vision and neural control for orange harvesting robot. In International Workshop on Neural Networks for Identication, Control, Robotics and Signal/Image Processing (Venice, Italy). IEEE Computer Society Press. Reinelt, G. (1991). TSPLIB—a traveling salesman problem library. ORSA Journal on Computing, 3, 376–384. Reinelt, G. (1994). The traveling salesman. Berlin: Springer-Verlag. Rodosek, R., & Wallace, M. (1996).Translating CLP(R) programs to mixed integer programs, a rst step towards integrating two programming paradigms. In Proc Intern. Symposium on Combinatorial Optimization (London). Rosenkrantz, D. J., Stearns, R. E., & Lewis, P. M. (1977). An analysis of several heuristics for the travelling salesman problem. SIAM Journal of Computing, 6, 563–581. Shin, K. G., & Mckay, N. D. (1985). Minimum time control of robotic manipulator with geometric path constraints. IEEE Transactions on Automatic Control, 31(6), 532–541. Smith, K. (1996). An argument for abandoning the traveling salesman problem as a neural-network benchmark. IEEE Transactions on Neural Networks, 7(6), 1542–1544. Thangiah, S. R., Osman, I. H., Vinayagamoorthy, R., & Sun, T. (1993). Algorithms for the vehicle routing problems with time deadlines. American Journal of Mathematical and Management Sciences, 13, 323–355. Vakhutinsky, A. I., and Golden, B. (1995). A hierarchical strategy for solving traveling salesman problem using elastic nets. Journal of Heuristics, 2, 67–76. Van den Bout, D. E., & Miller, T. K. (1989). Improving the performance of the Hopeld-Tank neural network through normalization and annealing. Biological Cybernetics, 62, 129–139. Received March 14, 2000; accepted May 14, 2001.
VIEW
Communicated by Bard Ermentrout
What Geometric Visual Hallucinations Tell Us about the Visual Cortex Paul C. Bressloff
[email protected] Department of Mathematics, University of Utah, Salt Lake City, Utah 84112, U.S.A.
Jack D. Cowan
[email protected] Department of Mathematics, University of Chicago, Chicago, IL 60637, U.S.A.
Martin Golubitsky
[email protected] Department of Mathematics, University of Houston, Houston, TX 77204-3476,U.S.A.
Peter J. Thomas
[email protected] Computational Neurobiology Laboratory, Salk Institute for Biological Studies, San Diego, CA 92186-5800, U.S.A.
Matthew C. Wiener
[email protected] Laboratory of Neuropsychology, National Institutes of Health, Bethesda, MD 20892, U.S.A.
Many observers see geometric visual hallucinations after taking hallucinogens such as LSD, cannabis, mescaline or psilocybin; on viewing bright ickering lights; on waking up or falling asleep; in “near-death” experiences; and in many other syndromes. Kluver ¨ organized the images into four groups called form constants: (I) tunnels and funnels, (II) spirals, (III) lattices, including honeycombs and triangles, and (IV) cobwebs. In most cases, the images are seen in both eyes and move with them. We interpret this to mean that they are generated in the brain. Here, we summarize a theory of their origin in visual cortex (area V1), based on the assumption that the form of the retino–cortical map and the architecture of V1 determine their geometry. (A much longer and more detailed mathematical version has been published in Philosophical Transactions of the Royal Society B, 356 [2001].) We model V1 as the continuum limit of a lattice of interconnected hypercolumns, each comprising a number of interconnected iso-orientation columns. Based on anatomical evidence, we assume that the lateral conNeural Computation 14, 473–491 (2002)
° c 2002 Massachusetts Institute of Technology
474
P. C. Bressloff et al.
nectivity between hypercolumns exhibits symmetries, rendering it invariant under the action of the Euclidean group E (2), composed of reections and translations in the plane, and a (novel) shift-twist action. Using this symmetry, we show that the various patterns of activity that spontaneously emerge when V1’s spatially uniform resting state becomes unstable correspond to the form constants when transformed to the visual eld using the retino-cortical map. The results are sensitive to the detailed specication of the lateral connectivity and suggest that the cortical mechanisms that generate geometric visual hallucinations are closely related to those used to process edges, contours, surfaces, and textures. 1 Introduction Seeing vivid visual hallucinations is an experience described in almost all human cultures. Painted hallucinatory images are found in prehistoric caves (Clottes & Lewis-Williams, 1998) and scratched on petroglyphs (Patterson, 1992). Hallucinatory images are seen both when falling asleep (Dybowski, 1939) and on waking up (Mavromatis, 1987), following sensory deprivation (Zubek, 1969), after taking ketamine and related anesthetics (Collier, 1972), after seeing bright ickering light (Purkinje, 1918; Helmholtz, 1925; Smythies, 1960), or on applying deep binocular pressure on the eyeballs (Tyler, 1978), in “near-death” experiences (Blackmore, 1992), and, most striking, shortly after taking hallucinogens containing ingredients such as LSD, cannabis, mescaline, or psilocybin (Siegel & Jarvik, 1975). In most cases, the images are seen in both eyes and move with them, but maintain their relative positions in the visual eld. We interpret this to mean that they are generated in the brain. One possible location for their origin is provided by fMRI studies of visual imagery suggesting that V1 is activated when human subjects are instructed to inspect the ne details of an imagined visual object (Miyashita, 1995). In 1928, Kluver ¨ (1966) organized such images into four groups called form constants: (I) tunnels and funnels, (II) spirals, (III) lattices, including honeycombs and triangles, and (IV) cobwebs, all of which contain repeated geometric structures. Figure 1 shows their appearance in the visual eld. Ermentrout and Cowan (1979) provided a rst account of a theory of the generation of such form constants. Here we develop and elaborate this theory in the light of the anatomical and physiological data that have accumulated since then. 1.1 Form Constants and the Retino–Cortical Map. Assuming that form constants are generated in part in V1, the rst step in constructing a theory of their origin is to compute their appearance in V1 coordinates. This is done using the topographic map between the visual eld and V1. It is well established that the central region of the visual eld has a much bigger representation in V1 than does the peripheral eld (Drasdo, 1977; Sereno
Geometric Visual Hallucinations
475
Figure 1: Hallucinatory form constants. (I) Funnel and (II) spiral images seen following ingestion of LSD (redrawn from Siegel & Jarvik, 1975),(III) honeycomb generated by marijuana (redrawn from Siegel & Jarvik, 1975), and (IV) cobweb petroglyph (redrawn from Patterson, 1992).
et al., 1995). Such a nonuniform retino-cortical magnication is generated by the nonuniform packing density of ganglion cells in the retina, whose axons in the optic nerve target neurons in the lateral geniculate nucleus (LGN), and (subsequently) in V1, that are much more uniformly packed. Let rR D frR , hR g be the (polar) coordinates of a point in the visual eld and r D fx, yg its corresponding V1 coordinates. Given a retinal ganglion cell
476
P. C. Bressloff et al.
packing density r R (cells per unit retinal area) approximated by the inverse square law, rR D
1 , ( w0 C 2 rR ) 2
(Drasdo, 1977), and a uniform V1 packing density r, we derive a coordinate transformation (Cowan, 1977) that reduces, sufciently far away from the fovea (when rR À w 0 /2 ) to a scaled version of the complex logarithm (Schwartz, 1977), xD 2
a
µ ln
2
w0
¶ rR ,
yD
bhR 2
,
(1.1)
where w0 and 2 are constants. Estimates of w 0 D 0.087 and 2 D 0.051 in appropriate units can be obtained from Drasdo’s data on the human retina, and a and b are constants. These values correspond to a magnication factor of 11.5 mm / degrees of visual angle at the fovea and 5.75 mm / degrees of visual angle at rR D w0 /2 D 1.7 degrees of visual angle. We can also compute how the complex logarithm maps local edges or contours in the visual eld, that is, its tangent map. Let w R be the orientation preference of a neuron in V1 whose receptive eld is centered at the point rR in the visual eld, and let fr, w g be the V1 image of frR , w R g. It can be shown (Wiener, 1994; Cowan, 1997; Bressloff, Cowan, Golubitsky, Thomas, & Wiener, 2001) that under the tangent map of the complex logarithm, w D w R ¡ hR . Given the retino-cortical map frR , w R g ! fr, w g, we can compute the form of logarithmic spirals, circles, and rays and their local tangents in V1 coordinates. The equation for spirals can be written as hR D a ln[rR exp(¡b) ] C c, whence y ¡ c D a ( x ¡ b ) under the action rR ! r. Thus, logarithmic spirals become oblique lines of constant slope a in V1. Circles and rays correspond to vertical (a D 1) and horizontal (a D 0) lines. Since the slopes of the local tangents to logarithmic spirals are given by the equation tan (w R ¡ hR ) D a, and therefore in V1 coordinates by tan w D a, the local tangents in V1 lie parallel to the images of logarithmic spirals. It follows that types I and II form constants corresponding to stripes of neural activity at various angles in V1 and types III and IV to spatially periodic patterns of local tangents in V1, as shown in Figure 2. On the average, about 30 to 36 stripes and about 60 to 72 contours are perceived in the visual eld, corresponding to a wavelength of about 2.4 to 3.2 mm in human V1. This is comparable to estimates of about twice V1 hypercolumn spacing (Hubel & Wiesel, 1974) derived from the responses of human subjects to perceived grating patterns (Tyler, 1982) and from direct anatomical measurements of human V1 (Horton & Hedley-Whyte, 1984). All of these facts and calculations reinforce the suggestion that the form constants are located in V1. It might be argued that since hallucinatory images are seen as continuous across the midline, they must be located at higher levels in the visual pathway than V1. However,
Geometric Visual Hallucinations
477
Figure 2: (a) Retino-cortical transform. Logarithmic spirals in the visual eld map to straight lines in V1. (b, c) The action of the map on the outlines of funnel and spiral form constants.
there is evidence that callosal connections along the V1 / V2 border can act to maintain continuity of the images across the vertical meridian (Hubel & Wiesel, 1962). Thus, if hallucinatory images are generated in both halves of V1, as long as the callosal connections operate, such images will be continuous across the midline.
478
P. C. Bressloff et al.
Figure 3: Outline of the architecture of V1 represented by equation 1.2. Local connections between iso-orientation patches within a hypercolumn are assumed to be isotropic. Lateral connections between iso-orientation patches in different hypercolumns are assumed to be anisotropic.
1.2 Symmetries of V1 Intrinsic Circuitry. In recent years, data concerning the functional organization and circuitry of V1 have accrued from microelectrode studies (Gilbert & Wiesel, 1983), labeling, and optical imaging (Blasdel & Salama, 1986; Bosking, Zhang, Schoeld, & Fitzpatrick, 1997). Perhaps the most striking result is Blasdel and Salama’s (1986) demonstration of the large-scale organization of iso-orientation patches in primates. We can conclude that approximately every 0.7 mm or so in V1 (in macaque), there is an iso-orientation patch of a given preference w and that each hypercolumn has a full complement of such patches, with preferences running from w to w C p . The other striking result concerns the intrinsic horizontal connections in layers 2, 3, and (to some extent) 5 of V1. The axons of these connections make terminal arbors only every 0.7 mm or so along their tracks (Gilbert & Wiesel, 1983), and connect mainly to cells with similar orientation preferences (Bosking et al., 1997). In addition, there is a pronounced anisotropy to the pattern of such connections: differing isoorientation patches preferentially connect to patches in neighboring hypercolumns in such a way as to form continuous contours under the action of the retino-cortical map described above (Gilbert & Wiesel, 1983; Bosking et al., 1997). This contrasts with the pattern of connectivity within any one hypercolumn, which is much more isotropic: any given iso-orientation patch connects locally in all directions to all neighboring patches within a radius of less than 0.7 mm (Blasdel & Salama, 1986). Figure 3 shows a diagram of such connection patterns.
Geometric Visual Hallucinations
479
Since each location in the visual eld is represented in V1, roughly speaking, by a hypercolumn-sized region containing all orientations, we treat r and w as independent variables, so that all possible orientation preferences exist at each corresponding position rR in the visual eld. This continuum approximation allows a mathematically tractable treatment of V1 as a lattice of hypercolumns. Because the actual hypercolumn spacing (Hubel & Wiesel, 1974) is about half the effective cortical wavelength of many of the images we wish to study, care must be taken in interpreting cortical activity patterns as visual percepts. In addition, it is possible that lattice effects could introduce deviations from the continuum model. Within the continuum framework, let w (r, w | r0 , w 0 ) be the strength or weight of connections from the iso-orientation patch at fx0 , y0 g D r0 in V1 with orientation preference w 0 to the patch at fx, yg D r with preference w . We decompose w in terms of local connections from elements within the same hypercolumn and patchy lateral connections from elements in other hypercolumns, that is, w (r, w | r0 , w 0 ) D wLOC (w ¡ w 0 )d (r ¡ r0 )
C bwLAT ( s )d (r ¡ r0 ¡ sew )d (w ¡ w 0 ) ,
(1.2)
where d (¢) is the Dirac delta function, ew is a unit vector in the w –direction, b is a parameter that measures the weight of lateral relative to local connections, and wLAT (s ) is the weight of lateral connections between isoorientation patches separated by a cortical distance s along a visuotopic axis parallel to their orientation preference. The fact that the lateral weight depends on a rotated vector expresses its anisotropy. Observations by Hirsch and Gilbert (1991) suggest that b is small and therefore that the lateral connections modulate rather than drive V1 activity. It can be shown (Wiener, 1994; Cowan, 1997; Bressloff et al., 2001) that the weighting function dened in equation 1.2 has a well-dened symmetry: it is invariant with respect to certain operations in the plane of V1–translations fr, w g ! fr C u, wg, reections fx, y, wg ! fx, ¡y, ¡w g, and a rotation dened as fr, w g ! fRh [r], w Ch g, where Rh [r] is the vector r rotated by the angle h. This form of the rotation operation follows from the anisotropy of the lateral weighting function and comprises a translation or shift of the orientation preference label w to w C h, together with a rotation or twist of the position vector r by the angle h. Such a shift-twist operation (Zweck & Williams, 2001) provides a novel way to generate the Euclidean group E (2) of rigid motions in the plane. The fact that the weighting function is invariant with respect to this form of E (2) has important consequences for any model of the dynamics of V1. In particular, the equivariant branching lemma (Golubitsky & Schaeffer, 1985) guarantees that when the homogeneous state a (r, w ) D 0 becomes unstable, new states with the symmetry of certain subgroups of E (2) can arise. These new states will be linear combinations of patterns we call planforms.
480
P. C. Bressloff et al.
2 A Model of the Dynamics of V1 Let a (r, w , t) be the average membrane potential or activity in an iso-orientation patch at the point r with orientation preference w . The activity variable a (r, w , t ) evolves according to a generalization of the Wilson-Cowan equations (Wilson & Cowan, 1973) that incorporates an additional degree of freedom to represent orientation preference and the continuum limit of the weighting function dened in equation 1.2: @a (r, w , t) @t
D ¡aa(r, w , t) C
Z
Cº
1 ¡1
m p
Z
p 0
wLOC (w ¡ w 0 ) s[a(r, w 0 , t) ]dw 0
wLAT ( s) s[a(r C sew , w , t) ] ds,
(2.1)
where a, m , and º D mb are, respectively, time and coupling constants, and s[a] is a smooth sigmoidal function of the activity a. It remains only to specify the form of the functions wLOC (w ) and wLAT ( s) . In the single population model considered here, we do not distinguish between excitatory and inhibitory neurons and therefore assume both wLOC (w ) and wLAT ( s) to be “Mexican hats” of the generic form, s1¡1 exp(¡x2 / 2s12 ) ¡ s2¡1 exp(¡x2 / 2s22 ) ,
(2.2)
with Fourier transform W ( p ) D exp(¡s12 p2 ) ¡ exp(¡s22 p2 ) ,
(s1 < s2 ) , such that iso-orientation patches at nearby locations mutually excite if they have similar orientation preferences, and inhibit otherwise, and similar iso-orientation patches at locations at least r0 mm apart mutually inhibit. In models that distinguish between excitatory and inhibitory populations, it is possible to achieve similar results using inverted Mexican hat functions that incorporate short-range inhibition and long-range excitation. Such models may provide a better t to anatomical data (Levitt, Lund, & Yoshioka, 1996). An equation F[a] D 0 is said to be equivariant with respect to a symmetry operator c if it commutes with it, that is, if c F[a] D F[c a]. It follows from the Euclidean symmetry of the weighting function w (r, w | r0 , w 0 ) with respect to the shift-twist action that equation 2.1 is equivariant with respect to it. This has important implications for the dynamics of our model of V1. Let a (r, w ) D 0 be a homogeneous stationary state of equation 2.1 that depends smoothly on the coupling parameter m . When no iso-orientation patches are activated, it is easy to show that a (r, w ) D 0 is stable for all values of m less than a critical value m 0 (Ermentrout, 1998). However, the parameter m can reach m 0 if, for example, the excitability of V1 is increased by the action of hallucinogens on brain stem nuclei such as the locus ceruleus or the
Geometric Visual Hallucinations
481
raphe nucleus, which secrete the monoamines serotonin and noradrenalin. In such a case, the homogeneous stationary state a (r, w ) D 0 becomes unstable. If m remains close to m 0 , then new stationary states develop that are approximated by (nite) linear combinations of the eigenfunctions of the linearized version of equation 2.1. The equivariant branching lemma (Golubitsky & Schaeffer, 1985) then guarantees the existence of new states with the symmetry of certain subgroups of the Euclidean group E (2). This mechanism, by which dynamical systems with Mexican hat interactions can generate spatially periodic patterns that displace a homogeneous equilibrium, was rst introduced by Turing (1952) in his article on the chemical basis of morphogenesis. 2.1 Eigenfunctions of V1. Computing the eigenvalues and associated eigenfunctions of the linearized version of equation 2.1 is nontrivial, largely because of the presence of the anisotropic lateral connections. However, given that lateral effects are only modulatory, so that b ¿ 1, degenerate Rayleigh-Schr odinger ¨ perturbation theory can be used to compute them (Bressloff et al., 2001). The results can be summarized as follows. Let s1 be the slope of s[a] at the stationary equilibrium a D 0. Then to lowest order in b, the eigenvalues l § are l§ ( p, q ) D ¡a C m s1 [WLOC ( p ) C bfWLAT (0, q) § W LAT (2p, q) g],
(2.3)
where W LOC ( p ) is the pth Fourier mode of wLOC (w ) and Z 1 WLAT ( p, q ) D (¡1) p wLAT ( s) J2p ( qs) ds 0
is the Fourier transform of the lateral weight distribution with J2p ( x ) the Bessel function of x of integer order 2p. To these eigenvalues belong the associated eigenfunctions º§ (r, w ) D cn u § (w ¡ w n ) exp[ikn ¢ r]
C c ¤n u § ¤ (w ¡ w n ) exp[¡ikn ¢ r],
(2.4)
where kn D fq cos w n , q sin w n g and (to lowest order in b), u § (w ) D cos 2pw or sin 2pw, and the ¤ operator is complex conjugation. These functions will be recognized as plane waves modulated by even or odd phase-shifted p – periodic functions cos[2p(w ¡ w n ) ] or sin[2p(w ¡ w n ) ]. The relation between the eigenvalues l § and the wave numbers p and q given in equation 2.3 is called a dispersion relation. In case l § D 0 at m D m 0 , equation 2.3 can be rewritten as mc D
a s1 [WLOC ( p) C bfWLAT (0, q ) C WLAT (2p, q) g]
(2.5)
482
P. C. Bressloff et al.
Figure 4: Dispersion curves showing the marginal stability of the homogeneous mode of V1. Solid lines correspond to odd eigenfunctions, dashed lines to even ones. If V1 is in the Hubel-Wiesel mode (pc D 0), eigenfunctions in the form of unmodulated plane waves or roll patterns can appear rst at the wave numbers (a) qc D 0 and (b) qc D 1. If V1 is in the coupled–ring mode (pc D 1), odd modulated eigenfunctions can form rst at the wave numbers (c) qc D 0 and (d) qc D 1.
and gives rise to the marginal stability curves plotted in Figure 4. Examination of these curves indicates that the homogeneous state a (r, w ) D 0 can lose its stability in four differing ways, as shown in Table 1. Table 1: Instabilities of V1. Wave Numbers
Local Interactions
Lateral Interactions
pc D q c D 0 6 0 pc D 0, qc D 6 0, qc D 0 pc D 6 0, qc D 6 0 pc D
Excitatory Excitatory Mexican hat Mexican hat
Excitatory Mexican hat Excitatory Mexican hat
Eigenfunction cn cn cos[kn ¢ r] cn u§ (w ¡ w n ) C c¤n u § ¤ (w ¡ w n ) cn u § (w ¡ w n ) cos[kn ¢ r]
Geometric Visual Hallucinations
483
We call responses with pc D 0 (rows 1 and 2) the Hubel-Wiesel mode of V1 operation. In such a mode, any orientation tuning must be extrinsic to V1, generated, for example, by local anisotropies of the geniculo-cortical map 6 (Hubel & Wiesel, 1962). We call responses with pc D 0 (rows 3 and 4) the of V1 operation (Somers, Nelson, & Sur, 1996; Hansel & coupled ring mode Sompolinsky, 1997; Mundel, Dimitrov, & Cowan, 1997). In both cases, the model reduces to a system of coupled hypercolumns, each modeled as a ring of interacting iso-orientation patches with local excitation and inhibition. Even if the lateral interactions are weak (b ¿ 1), the orientation preferences become spatially organized in a pattern given by some combination of the eigenfunctions º§ (r, w ) . 3 Planforms and V1 It remains to compute the actual patterns of V1 activity that develop when the uniform state loses stability. These patterns will be linear combinations of the eigenfunctions described above, which we call planforms. To compute them, we nd those planforms that are left invariant by the axial subgroups of E (2) under the shift-twist action. An axial subgroup is a restricted set of the symmetry operators of a group, whose action leaves invariant a one-dimensional vector subspace containing essentially one planform. We limit these planforms to regular tilings of the plane (Golubitsky, Stewart, & Schaeffer, 1988), by restricting our computation to doubly periodic planforms. Given such a restriction, there are only a nite number of shift-twists to consider (modulo an arbitrary rotation of the whole plane), and the axial subgroups and their planforms have either rhombic, square, or hexagonal symmetry. To determine which planforms are stabilized by the nonlinearities of the system, we use Lyapunov-Schmidt reduction (Golubitsky et al., 1988) and Poincare -Lindstedt perturbation theory (Walgraef, 1997) to reduce the dynamics to a set of nonlinear equations for the amplitudes cn in equation 2.4. Euclidean symmetry restricts the structure of the amplitude equations to the form (on rhombic and square lattices) 2 X d c nm |cm | 2 ], cn D cn [(m ¡ m 0 ) ¡ c 0 |cn | 2 ¡ 2 dt 6 n mD
(3.1)
where c nm depends on D w , the angle between the two eigenvectors of the lattice. We nd that all rhombic planforms with 30± · D w · 60± are stable. Similar equations obtain for hexagonal lattices. In general, all such planforms on rhombic and hexagonal lattices are stable (in appropriate parameter regimes) with respect to perturbations with the same lattice structure. However, square planforms on square lattices are unstable and give way to simple roll planforms comprising one eigenfunction.
484
P. C. Bressloff et al.
Figure 5: V1 Planforms corresponding to some axial subgroups. (a, b): Roll and hexagonal patterns generated in V1 when it is in the Hubel-Wiesel operating mode. (c, d) Honeycomb and square patterns generated when V1 is in the coupled-ring operating mode.
3.1 Form Constants as Stable Planforms. Figure 5 shows some of the (mostly) stable planforms dened above in V1 coordinates and Figure 6 in visual eld coordinates, generated by plotting at each point r a contour element at the orientation w most strongly signaled at that point—that point,that is, for which a (r, w ) is largest. These planforms are either contoured or noncontoured and correspond closely to Kluver’s form constants. In what we call the Hubel-Wiesel mode, interactions between iso-orientation patches in a single hypercolumn are weak and purely excitatory, so individual hypercolumns do not amplify any particular orientation. However, if the long-range interactions are stronger
Geometric Visual Hallucinations
485
Figure 6: The same planforms as shown in Figure 5 drawn in visual eld coordinates. Note however, that the feathering around the edges of the bright blobs in b is a fortuitous numerical artifact.
and effectively inhibitory, then plane waves of cortical activity can emerge, with no label for orientation preference. The resulting planforms are called noncontoured and correspond to the types I and II form constants, as originally proposed by Ermentrout and Cowan (1979). On the other hand, if coupled-ring-mode interactions between neighboring iso-orientation patches are strong enough so that even weakly biased activity can trigger a sharply tuned response, under the combined action of many interacting hypercolumns, plane waves labeled for orientation preference can emerge. The resulting planforms correspond to types III and IV form constants. These results suggest that the circuits in V1 that are normally involved in the detection of oriented edges and the formation of contours are also responsible
486
P. C. Bressloff et al.
for the generation of the form constants. Since 20% of the (excitatory) lateral connections in layers 2 and 3 of V1 end on inhibitory interneurons (Hirsch & Gilbert, 1991), the overall action of the lateral connections can become inhibitory, especially at high levels of activity. The mathematical consequence of this inhibition is the selection of odd planforms that do not form continuous contours. We have shown that even planforms that do form continuous contours can be selected when the overall action of the lateral connection is inhibitory, if the model assumes deviation away from the visuotopic axis by at least 45 degrees in the pattern of lateral connections (Bressloff et al., 2001) (as opposed to our rst model, in which connections between iso-orientation patches are oriented along the visuotopic axis). This might seem paradoxical given observations suggesting that there are at least two circuits in V1: one dealing with contrast edges, in which the relevant lateral connections have the anisotropy found by Bosking et al. (1997), and others that might be involved with the processing of textures, surfaces, and color contrast, which seem to have a much more isotropic lateral connectivity (Livingstone & Hubel, 1984). However, there are several intriguing possibilities that make the analysis more plausible. The rst can occur in case V1 operates in the Hubel-Wiesel mode (pc D 0). In such an operating mode, if the lateral interactions are not as weak as we have assumed in our analysis, then even contoured planforms can form. The second related possibility can occur if at low levels of activity, V1 operates initially in the Hubel-Wiesel mode and the lateral and long-range interactions are all excitatory (pc D qc D 0), so that a bulk instability occurs if the homogeneous state becomes unstable, which can be followed by the emergence of patterned noncontoured planforms at the critical wavelength of about 2.67 mm, when the level of activity rises, and the longer-ranged inhibition 6 is activated (pc D 0, qc D 0), until V1 operates in the coupled-ring mode 6 6 when the short-range inhibition is also activated (pc D 0, qc D 0). In such a case, even planforms can be selected by the anistropic connectivity, since the degeneracy has already been broken by the noncontoured planforms. A third possibility is related to an important gap in our current model: we have not properly incorporated the observation that the distribution of orientation preferences in V1 is not spatially homogeneous. In fact, the well-known orientation “pinwheel” singularities originally found by Blasdel and Salama (1986) turn out to be regions in which the scatter or rate of change | D w / D r| of differing orientation preference labels is high—about 20 degrees—whereas in linear regions, it is only about 10 degrees (Maldonado & Gray, 1996). Such a difference leads to a second model circuit (centered on the orientation pinwheels) with a lateral patchy connectivity that is effectively isotropic (Yan, Dimitrov, & Cowan, 2001) and therefore consistent with observations of the connectivity of pinwheel regions (Livingstone & Hubel, 1984). In such a case, even planforms are selected in the coupled-ring mode. Types I and II noncontoured form constants could also arise from a “lling-in” process similar to that embodied in the retinex algo-
Geometric Visual Hallucinations
487
Figure 7: (a) Lattice tunnel hallucination seen following the taking of marijuana (redrawn from Siegel & Jarvik, 1975), with the permission of Alan D. Iselin. (b) Simulation of the lattice tunnel generated by an even hexagonal roll pattern on a square lattice.
rithm of Land and McCann (1971) following the generation of types III or IV form constants in the coupled-ring mode. The model allows for oscillating or propagating waves, as well as stationary patterns, to arise from a bulk instability. Such waves are in fact observed: many subjects who have taken LSD and similar hallucinogens report seeing bright white light at the center of the visual eld, which then “explodes” into a hallucinatory image (Siegel, 1977) in at most 3 seconds, corresponding to a propagation velocity in V1 of about 2.5 cm per second, suggestive of slowly moving epileptiform activity (Senseman, 1999). In this respect, it is worth noting that in the continuum approximation we use in this study, both planforms arising when the longrange interactions are purely excitatory correspond to the perception of a uniform bright white light. Finally, it should be emphasized that many variants of the Kluver ¨ form constants have been described, some of which cannot be understood in terms of the model we have introduced. For example the lattice tunnel shown in Figure 7a is more complicated than any of the simple form constants shown earlier. One intriguing possibility is that this image is generated as a result of a mismatch between the corresponding planform and the underlying structure of V1. We have (implicitly) assumed that V1 has patchy connections that endow it with lattice properties. It is clear from the optical image data (Blasdel & Salama, 1986; Bosking et al., 1997) that the cortical lattice is somewhat disordered. Thus, one might expect some distortions to occur when planforms are spontaneously generated in such
488
P. C. Bressloff et al.
a lattice. Figure 7b shows a computation of the appearance in the visual eld of a roll pattern on a hexagonal lattice represented on a square lattice, so that there is a slight incommensurability between the two. The resulting pattern matches the hallucinatory image shown in Figure 7a quite well. 4 Discussion This work extends previous work on the cortical mechanisms underlying visual hallucinations (Ermentrout & Cowan, 1979) by explicitly considering cells with differing orientation preferences and by considering the action of the retino-cortical map on oriented edges and local contours. This is carried out by the tangent map associated with the complex logarithm, one consequence of which is that w , the V1 label for orientation preference, is not exactly equal to orientation preference in the visual eld, w R but differs from it by the angle hR , the polar angle of receptive eld position. It follows that elements tuned to the same angle w should be connected along lines at that angle in V1. This is consistent with the observations of Blasdel and Sincich (personal communication, 1999) and Bosking et al. (1997) and with one of Mitchison and Crick’s (1982) hypotheses on the lateral connectivity of V1. Note that near the vertical meridian (where most of the observations have been made), changes in w approximate closely changes in w R . However, such changes should be relatively large and detectable with optical imaging near the horizontal meridian. Another major feature outlined in this article is the presumed Euclidean symmetry of V1. Many systems exhibit Euclidean symmetry. What is novel here is the way in which such a symmetry is generated: by a shift fr, w g ! fr C s, w g followed by a twist fr, w g ! fRh [r], w C h g, as well as by the usual translations and reections. It is the twist w ! w C h that is novel and is required to match the observations of Bosking et al. (1997). In this respect, it is interesting that Zweck and Williams (2001) recently introduced a set of basis functions with the same shift-twist symmetry as part of an algorithm to implement contour completion. Their reason for doing so is to bind sparsely distributed receptive elds together functionally, so as to perform Euclidean invariant computations. It remains to explain the precise relationship between the Euclidean invariant circuits we have introduced here and Euclidean invariant receptive eld models. Finally, we note that our analysis indicates the possibility that V1 can operate in two different dynamical modes, the Hubel-Wiesel mode or the coupled-ring mode, depending on the levels of excitability of its various excitatory and inhibitory cell populations. We have shown that the HubelWiesel mode can spontaneously generate noncontoured planforms and the coupled-ring mode contoured ones. Such planforms are seen as hallucinatory images in the visual eld, and their geometry is a direct consequence of the architecture of V1. We conjecture that the ability of V1 to process edges,
Geometric Visual Hallucinations
489
contours, surfaces, and textures is closely related to the existence of these two modes. Acknowledgments We thank A. G. Dimitrov, T. Mundel, and G. Blasdel and the referees for many helpful discussions. The work was supported by the Leverhulme Trust (P.C.B.), the James S. McDonnell Foundation (J.D.C.), and the National Science Foundation (M.G.). P. C. B. thanks the Mathematics Department, University of Chicago, for its hospitality and support. M. G. thanks the Center for Biodynamics, Boston University, for its hospitality and support. J. D. C. and P. C. B. thank G. Hinton and the Gatsby Computational Neurosciences Unit, University College, London, for hospitality and support. References Blackmore, S. J. (1992). Beyond the body: An investigation of out–of–the–body experiences. Chicago: Academy of Chicago. Blasdel, G., & Salama, G. (1986). Voltage-sensitive dyes reveal a modular organization in monkey striate cortex. Nature, 321(6070), 579–585. Bosking, W. H., Zhang, Y., Schoeld, B., & Fitzpatrick, D. (1997). Orientation selectivity and the arrangement of horizontal connections in tree shrew striate cortex. J. Neurosci., 17, 2112–2127. Bressloff, P. C., Cowan, J. D., Golubitsky, M., Thomas, P. J., & Wiener, M. C. (2001). Geometric visual hallucinations, Euclidean symmetry, and the functional architecture of striate cortex. Phil. Trans. Roy Soc. Lond. B, 356, 299–330. Clottes, J., & Lewis-Williams, D. (1998).The shamans of prehistory:Trance and magic in the painted caves. New York: Abrams. Collier, B. B. (1972). Ketamine and the conscious mind. Anaesthesia, 27, 120–134. Cowan, J. D. (1977). Some remarks on channel bandwidths for visual contrast detection. Neurosciences Research Program Bull., 15, 492–517. Cowan, J. D. (1997). Neurodynamics and brain mechanisms. In M. Ito, Y. Miyashita, & E. T. Rolls (Eds.), Cognition, computation, and consciousness (pp. 205–233). New York: Oxford University Press. Drasdo, N. (1977). The neural representation of visual space. Nature, 266, 554– 556. Dybowski, M. (1939). Conditions for the appearance of hypnagogic visions. Kwart. Psychol., 11, 68–94. Ermentrout, G. B. (1998). Neural networks as spatial pattern forming systems. Rep. Prog. Phys., 61, 353–430. Ermentrout, G. B., & Cowan, J. D. (1979). A mathematical theory of visual hallucination patterns. Biol. Cybernetics, 34, 137–150. Gilbert, C. D., & Wiesel, T. N. (1983). Clustered intrinsic connections in cat visual cortex. J. Neurosci., 3, 1116–1133. Golubitsky, M., & Schaeffer, D. G. (1985). Singularities and groups in bifurcation theory I. Berlin: Springer-Verlag.
490
P. C. Bressloff et al.
Golubitsky, M., Stewart, I., & Schaeffer, D. G. (1988). Singularities and groups in bifurcation theory II. Berlin: Springer-Verlag. Hansel, D., & Sompolinsky, H. (1997). Modeling feature selectivity in local cortical circuits. In C. Koch & I. Segev (Eds.), Methods of neuronal modeling (2nd ed., pp. 499–567). Cambridge, MA: MIT Press. Helmholtz, H. (1925). Physiological optics (Vol. 2). Rochester, NY: Optical Society of America. Hirsch, J. D., & Gilbert, C. D. (1991). Synaptic physiology of horizontal connections in the cat’s visual cortex. J. Neurosci., 11, 1800–1809. Horton, J. C., & Hedley-Whyte, E. T. (1984). Mapping of cytochrome oxidase patches and ocular dominance columns in human visual cortex. Phil. Trans. Roy. Soc. B, 304, 255–272. Hubel, D. H., & Wiesel, T. N. (1962). Receptive elds, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. (Lond.), 160, 106– 154. Hubel, D. H., & Wiesel, T. N. (1974). Sequence regularity and geometry of orientation columns in the monkey striate cortex. J. Comp. Neurol., 158, 267–294. Kluver, ¨ H. (1966). Mescal and mechanisms of hallucinations. Chicago: University of Chicago Press. Land, E., & McCann, J. (1971). Lightness and retinex theory. J. Opt. Soc. Am., 61(1), 1–11. Levitt, J. B., Lund, J., & Yoshioka, T. (1996). Anatomical substrates for early stages in cortical processing of visual information in the macaque monkey. Behavioral Brain Res., 76, 5–19. Livingstone, M. S., & Hubel, D. H. (1984). Specicity of intrinsic connections in primate primary visual cortex. J. Neurosci., 4, 2830–2835. Maldonado, P., & Gray, C. (1996). Heterogeneity in local distributions of orientation-selective neurons in the cat primary visual cortex. Visual Neuroscience, 13, 509–516. Mavromatis, A. (1987). Hypnagogia:The unique state of consciousness between wakefulness and sleep. London: Routledge & Kegan Paul. Mitchison, G., & Crick, F. (1982). Long axons within the striate cortex: Their distribution, orientation, and patterns of connection. Proc. Nat. Acad. Sci. (USA), 79, 3661–3665. Miyashita, Y. (1995). How the brain creates imagery: Projection to primary visual cortex. Science, 268, 1719–1720. Mundel, T., Dimitrov, A., & Cowan, J. D. (1997). Visual cortex circuitry and orientation tuning. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 887–893). Cambridge, MA: MIT Press. Patterson, A. (1992).Rock art symbols of the greatersouthwest. Boulder, CO: Johnson Books. Purkinje, J. E. (1918). Opera omnia (Vol. 1). Prague: Society of Czech Physicians. Schwartz, E. (1977). Spatial mapping in the primate sensory projection: Analytic structure and relevance to projection. Biol. Cybernetics, 25, 181–194. Senseman, D. M. (1999). Spatiotemporal structure of depolarization spread in cortical pyramidal cell populations evoked by diffuse retinal light ashes. Visual Neuroscience, 16, 65–79.
Geometric Visual Hallucinations
491
Sereno, M. I., Dale, A. M., Reppas, J. B., Kwong, K. K., Belliveau, J. W., Brady, T. J., Rosen, B. R., & Tootell, R. B. H. (1995). Borders of multiple visual areas in humans revealed by functional magnetic resonance imaging. Science, 268, 889–893. Siegel, R. K. (1977). Hallucinations. Scientic American, 237(4), 132–140. Siegel, R. K., & Jarvik, M. E. (1975). Drug–induced hallucinations in animals and man. In R. K. Siegel & L. J. West (Eds.), Hallucinations: Behavior, experience and theory (pp. 81–161). New York: Wiley. Smythies, J. R. (1960). The stroboscopic patterns III. further experiments and discussion. Brit. J. Psychol., 51(3), 247–255. Somers, D., Nelson, S., & Sur, M. (1996). An emergent model of orientation selectivity in cat visual cortex simple cells. J. Neurosci., 15(8), 5448–5465. Turing, A. M. (1952). The chemical basis of morphogenesis. Phil. Trans. Roy Soc. Lond. B, 237, 32–72. Tyler, C. W. (1978). Some new entoptic phenomena. Vision Res., 181, 1633–1639. Tyler, C. W. (1982). Do grating stimuli interact with the hypercolumn spacing in the cortex? Suppl. Invest. Ophthalmol., 222, 254. Walgraef, D. (1997). Spatio–temporal pattern formation. New York: SpringerVerlag. Wiener, M. C. (1994). Hallucinations, symmetry, and the structure of primary visual cortex: A bifurcation theory approach. Unpublished doctoral dissertation, University of Chicago. Wilson, H. R., & Cowan, J. D. (1973). A mathematical theory of the functional dynamics of cortical and thalamic nervous tissue. Kybernetik, 13, 55–80. Yan, C.-P., Dimitrov, A., & Cowan, J. (2001). Spatial inhomogeneity in the visual cortex and the statistics of natural images. Unpublished manuscript. Zubek, J. (1969). Sensory deprivation:Fifteen years of research. New York: AppletonCentury-Crofts. Zweck, J. W., & Williams, L. R. (2001). Euclidean group invariant computation of stochastic completion elds using shiftable-twistable functions. Manuscript submitted for publication. Received December 20, 2000; accepted April 26, 2001.
LETTER
Communicated by Bard Ermentrout
An Amplitude Equation Approach to Contextual Effects in Visual Cortex Paul C. Bressloff
[email protected] Department of Mathematics, University of Utah, Salt Lake City, Utah 84112, U.S.A.
Jack D. Cowan
[email protected] Mathematics Department, University of Chicago, Chicago, IL 60637, U.S.A.
A mathematical theory of interacting hypercolumns in primary visual cortex (V1) is presented that incorporates details concerning the anisotropic nature of long-range lateral connections. Each hypercolumn is modeled as a ring of interacting excitatory and inhibitory neural populations with orientation preferences over the range 0 to 180 degrees. Analytical methods from bifurcation theory are used to derive nonlinear equations for the amplitude and phase of the population tuning curves in which the effective lateral interactions are linear in the amplitudes. These amplitude equations describe how mutual interactions between hypercolumns via lateral connections modify the response of each hypercolumn to modulated inputs from the lateral geniculate nucleus; such interactions form the basis of contextual effects. The coupled ring model is shown to reproduce a number of orientation-dependent and contrast-dependent features observed in center-surround experiments. A major prediction of the model is that the anisotropy in lateral connections results in a nonuniform modulatory effect of the surround that is correlated with the orientation of the center. 1 Introduction The discovery that a majority of neurons in the visual or striate cortex of cats and primates (usually referred to as V1) respond selectively to the local orientation of visual contrast patterns (Hubel & Wiesel, 1962) initiated many studies of the precise circuitry underlying this property. Two cortical circuits have been fairly well characterized. There is a local circuit operating at subhypercolumn dimensions ( 0. In terms of the orientation tuning properties observed in real neurons, the most relevant cases are p D 0 and p D 1. The former corresponds to a bulk instability within a hypercolumn in which the new steady state exhibits no orientation preference. We call such a response the Hubel–Wiesel mode since any orientation tuning must be extrinsic to V1, generated, for example, by local anisotropies of the geniculo–cortical map (Hubel & Wiesel, 1962). A bulk instability will occur when the local inhibition is sufciently weak. In the second case, each hypercolumn supports an activity prole consisting of a solitary peak centered about the angle w (rj ) , that is, the population response is characterized by an orientation tuning curve. One mechanism that generates a p > 0 mode is if the local connections comprise short-range excitation and longer-range inhibition, which would be the case if the inhibitory neurons are identied with basket cells (I D Ba) whose lateral axonal spread has a space constant of about 250 m m. However, it is also possible within a two-population model to generate orientation tuning in the presence of short–range inhibition, as would occur if the inhibitory neurons are identied with local interneurons (I D Ma) instead whose lateral axonal spread is about 20 m m (see Figure 1). Example spectra W nC for both of these cases are plotted in Figure 4 for gaussian coefcients p 2 2 (3.15) W lm ( n ) D 2p jlm alm e ¡n jlm / 2 ,
Amplitude Equation Approach
503
Figure 4: (a) Distribution of eigenvalues WnC as a function of n in the case of long-range inhibition. Parameter values are jEE D jIE D 15± , jII D jEI D 60± , aEE D 1, aEI aIE D 0.3, aII D 0. (b) Distribution of eigenvalues WnC as a function of n in the case of short-range inhibition. Parameter values are jEE D 35± , jIE D 40± , jII D jEI D 5± , aEE D 0.5, aEI aIE D 0.5, aII D 1. Solid line shows real part and dashed line shows the imaginary part.
where jlm determine the range of the axonal elds of the excitatory and inhibitory populations. For both examples, it can be seen that ReW1C > ReWnC 6 for all n D 1 so that the n D 1 mode becomes marginally stable rst, leading to the spontaneous formation of sharp orientation tuning curves. (Note 6 that if ImW1C D 0 at the critical point, then the homogeneous resting state bifurcates to an orientation tuning curve whose peak spontaneously either rotates as a traveling wave or pulsates as a standing wave at a frequency determined by Iml1C (Ben-Yishai et al., 1997; Bressloff et al., 2000). We consider the time–periodic case in Section 3.3. The location of the peak w ¤ (rj ) of the tuning curve at rj is arbitrary in the presence of w-independent inputs hl (rj , w ) D hN l (rj ) . However, the inclusion of an additional small-amplitude input D hl (rj , w ) » cos[2(w ¡ Wj ) ] breaks the rotational invariance of the system and locks the location of the tuning curve to the orientation corresponding to the peak of the stimulus, that is, w ¤ (rj ) D Wj . This is illustrated in Figure 5, where the input and output
504
Paul C. Bressloff and Jack D. Cowan
Figure 5: Sharp orientation tuning curve in a single hypercolumn. Local recurrent excitation and inhibition amplies a weakly modulated input from the LGN. Dotted line is the baseline output without orientation tuning.
of the excitatory population of a single hypercolumn are shown (see also Section 4.1). Thus, the local intracortical connections within a hypercolumn serve to amplify a weakly oriented input signal from the LGN (Somers et al., 1995; Ben-Yishai et al., 1997). The proportion of thalamocortical versus intracortical input in vivo and the tuning of the thalamocortical input are matters of ongoing debate. It is still not known for certain whether orientation selectivity is introduced in a feedforward manner to subsequent layers or arises from the intrinsic recurrent architecture of the upper layers of layer 4 or higher, or from some combination of these mechanisms. Recent work appears to conrm the original proposal of Hubel and Wiesel (1962) that geniculate input is sharply tuned in the cat (Ferster, Chung, & Wheat, 1997). On the other hand, a considerable body of work indicates that intracortical processes, including recurrent excitation and intracortical inhibition, are important in determining orientation selectivity (Blakemore & Tobin, 1972; Sillito, 1975; Douglas et al., 1995; Somers et al., 1995). In primates such as the macaque, most layer IV C cells receiving direct thalamocortical input are unoriented, and thus the introduction of orientation asymmetry is clearly a cortical process. If orientation information is to be represented cortically and kept “on–line” for any period of time, then specic intracortical circuitry not too dissimilar from the tuning circuitry of the primate (e.g., local recurrent excitation and inhibition) is necessary to stably represent the sharply tuned input. The work of Phleger and Bonds (1995) demonstrating a loss of stable orientation tuning in cats with blocking of intracortical inhibition supports this view. 3.2 Cubic Amplitude Equation: Stationary Case. So far we have established that in the absence of lateral connections, each hypercolumn (labeled
Amplitude Equation Approach
505
by cortical position rj ) can exhibit sharp orientation tuning when in a sufciently excited state. The peak of the tuning curve is xed by inputs from the LGN signaling the presence of a local oriented bar in the classical receptive eld (CRF) of the given hypercolumn. We wish to investigate how such activity is further modulated by stimuli outside the CRF due to the presence of anisotropic lateral connections between hypercolumns leading to contextual effects. Our approach will be to exploit the fact that the lateral connections are weak relative to local circuit connections and to use bifurcation analysis to derive dynamical equations for the amplitudes of the excited modes close to the bifurcation point. These equations then allow us to explore the effects of local and lateral inputs on sharp orientation tuning. For simplicity, we assume that each hypercolumn can be in one of two states of activation distinguished by the index  (ri ) D 0, 1, where ri is the cortical position of the ith hypercolumn. If  (ri ) D 0, then the neurons within the hypercolumn are well below threshold so that there is only spontaneous activity, sl ¼ 0. On the other hand, if  (ri ) D 1, then the hypercolumn exhibits sharp orientation tuning with mean levels of output activity sl ( aN l (ri ) ) where aN l (ri ) is a xed-point solution of equations 3.4 and 3.5 close to the bifurcation point. Any hypercolumn in the active state is taken to have the same mean output activity. That is, if we denote the set of active hypercolumns by J D fi 2 1, . . . , NI  (ri ) D 1g, then aN l (ri ) D aN l for all i 2 J , and we set sl ( aNl ) D sN l . First, perform a Taylor expansion of equation 2.4 with respect to bl (ri , w , t ) D al (ri , w , t) ¡ aN l , i 2 J , @bl @t
D ¡bl C
X m D E,I
h i wlm ¤ m bm C c m b2m C c m0 b3m C . . . C D hl
¡
C 2 bl w O ± [sN E C m bE C . . .] Â
¢
(3.16)
where D hl D hl ¡ hN l and m D sl0 (aN l ) , c l D s 00 (aN l ) / 2, c l0 D s 000 (aN l ) / 6. The convolution operation ¤ is dened by equation 3.3 and [wO ± f ] (ri , w ) D
X 6 i j2J , j D
wO ( rij )
Z p /2 ¡p / 2
p 0 (w 0 ¡hij ) p1 (w ¡ w 0 ) f (rj , w 0 )
dw 0 2p 0
(3.17)
for an arbitrary function f (r, w ) , and wO given by equation 2.5. Suppose that the system is 2 -close to the point of marginal stability of the homogeneous xed point associated with excitation of the modes e § 2iw . That is, take m D m c C2 D m where m c D 1 / W 1C (see equation 3.11). Substitute into equation 3.16
506
Paul C. Bressloff and Jack D. Cowan
the perturbation expansion, (1) (2) (3) bm D 2 1 / 2 bm C 2 bm C 2 3/ 2 bm C ¢ ¢ ¢
(3.18)
Finally, introduce a slow timescale t D 2 t and collect terms with equal powers of 2 . This leads to a hierarchy of equations of the form (up to O (2 3 / 2 ) ): [L b (1) ]l D 0 [L b
(2)
]l D ´
[L b
(3)
(3.19)
(2) vl
(3.20)
X
(1) 2 ] C b l sN E wO ± Â c m wlm ¤ [bm
mD E,I (3)
]l D vl
(1)
´¡
@bl
C
@t
C
X m D E,I
h i (1) (1) 3 (1) (2) 0 Ccm [bm ] C 2c m bm wlm ¤ D m bm bm ±
²
D hl C m c bl wO ± bE(1) Â ,
with the linear operator L dened according to X [L b]l D bl ¡ m c wlm ¤ bm .
(3.21)
(3.22)
m D E,I
We have also assumed that the modulatory external input is O (2 3/ 2 ) and rescaled D hl ! 2 3 / 2 D hl . Recall that each active hypercolumn is assumed to exhibit sharp orientation tuning, so that the local connections are such that the rst equation in the hierarchy, equation 3.19, has solutions of the form ± ² b (1) (ri , w , t ) D z (ri , t ) e2iw C z (ri , t ) e ¡2iw B (3.23) for all i 2 J , with B ´ B1C dened in equation 3.14 and z denotes the complex conjugate of z. We obtain a dynamical equation for the complex amplitude z (ri , t ) by deriving solvability conditions for the higher-order equations. We proceed by taking the inner product of equations 3.20 and 3.21 with the dual eigenmode e b(w ) D e2iw e B where
³
e BD
WIE (1) 1 ¡ 2 [WEE (1) C WII (1) ¡ S (1)]
so that [L
b]l ´ e bl ¡ m c
Te
X m DE,I
wml ¤ e bm D 0.
´
(3.24)
Amplitude Equation Approach
507
The inner product of any two vector-valued functions of w is dened as Z hu|vi D
p 0
[uE (w ) vE (w ) C uI (w ) vI (w ) ]
dw . p
(3.25)
With respect to this inner product, the linear operator L satises he b| L bi D h L Te b|bi D 0 for any b. Since L b ( p) D v ( p ) , we obtain a hierarchy of solvability conditions he b|v ( p) i D 0 for p D 2, 3, . . .. It can be shown from equations 3.17, 3.20, and 3.23 that the rst solvability condition requires that X (1) (1) bl sN E P 0 P1 (3.26) wO ( rij )e ¡2ihij · O (2 1 / 2 ) j2J
where "Z ( n) Pk
D
#
p /2
dw e ¡2niw p k (w ) p ¡p / 2
.
(3.27)
This is analogous to the so-called adaptation condition in the bifurcation theory of weakly connected neurons (Hoppensteadt & Izhikevich, 1997). If it is violated, then at this level of approximation, the interactions between hypercolumns become trivial. There are two ways in which equation 3.26 can be P satised: the conguration of surrounding hypercolumns is such that j wO (rij ) e ¡2ihij · O (2 1/ 2 ) or the mean ring rate sN E · O (2 1/ 2 ). Here we assume that sN E D O (2 1/ 2 ) . The solvability condition he b|v (3) i D 0 generates a cubic amplitude equation for z (ri , t ). As a further simplication, we set c m D 0, since this does not alter the basic structure of the amplitude equation. (Note, however, that the coefcients of the amplitude equation are c m dependent, and hence the 6 0.) Using equations 3.17, stability properties of a pattern may change ifc m D 3.21, and 3.23, we then nd that (after rescaling t ) @z (ri , t ) @t
2 D z (ri , t ) ( D m ¡ A|z (ri , t ) | ) C f (ri ) C b
X j2J
wO (rij )
h i (2) (1) £ z (rj , t ) C z (rj , t ) P 0 e ¡4ihij C sP 0 e ¡2ihij , p for all i 2 J where s D sN E / 2 m c BE , X (1) b D P1 Dl bl l D E,I
f (ri ) D m c
X l DE,I
el B
Z
p 0
e ¡2iw D hl (ri , w )
(3.28)
(3.29) dw p
(3.30)
508
Paul C. Bressloff and Jack D. Cowan
and Dl D
m 2c BE e Bl , e BT B
AD¡
3 Xe 0 3 Blc l Bl . e BT B lD E,I
(3.31)
Equation 3.28 is our reduced model of weakly interacting hypercolumns. It describes the effects of anisotropic lateral connections and modulatory inputs from the LGN on the dynamics of the (complex) amplitude z (ri , t ) . The latter determines the response properties of the orientation tuning curve associated with the hypercolumn at cortical position ri . The coupling parameter b is a linear combination of the relative strengths of the lateral connections’ innervating excitatory neurons and those innervating inhibitory neurons, with DE , DI determined by the local weight distribution. Since DE > 0 and DI < 0, we see that the effective interactions between hypercolumns have both an excitatory and an inhibitory component. (The factor (1) P1 appearing in equation 3.29, which arises from the spread in lateral connections with respect to orientation—see Figure 3b—generates a positive rescaling of the lateral interactions and can be absorbed into the parameters (n ) DE and DI .) Note that in the case of isotropic lateral connections, P k D 0 for all n > 1 so that z decouples from z in the amplitude equation 3.28. 3.3 Cubic Amplitude Equation: Oscillatory Case. In our derivation of the amplitude equation, 3.28, we assumed that the local cortical circuit generates a stationary orientation tuning curve. However, as shown in Section 3.1, it is possible for a time-periodic tuning curve to occur when 6 0. Taylor expanding 2.4 as before leads to the hierarchy of equaImW1C D tions, 3.19 through 3.21, except that the linear operator L ! L t D L C @/ @t. The lowest-order solution, equation 3.23, now takes the form h i b (1) (ri , w , t, t ) D zL (ri , t ) ei (V0 t¡2w ) C zR (ri , t )ei (V0 t C 2w ) B C c.c.,
(3.32)
where zL and zR represent the complex amplitudes for anticlockwise (L) and clockwise (R) rotating waves (around the ring of a single hypercolumn), and p V 0 D m c 4WEI (1) WIE (1) ¡ [WEE (1) C W II (1)]2 .
(3.33)
Introduce the generalized inner product, 1 T!1 T
hu | vi D lim
Z T/ 2 Z ¡T/ 2
p 0
[uE (w , t) vE (w , t) C uI (w , t) vI (w , t) ]
dw dt, p
(3.34)
and the dual vectors e bL D e Bei (V0 t¡2w ) , e bR D e Bei (V0 t C 2w ) . Using the fact that e e hbL | L t bi D hbR | L t bi D 0 for arbitrary b, we obtain the pair of solvability conditions he bL |v ( p) i D he bR |v ( p) i D 0 for each p ¸ 2.
Amplitude Equation Approach
509
The p D 2 solvability conditions are identically satised. The p D 3 solvability conditions then generate cubic amplitude equations for zL , zR of the form @zL (ri , t ) @t
h D (1 C iV 0 ) zL (ri , t ) C f ¡ (ri ) C b
X j2J
D m ¡ A|zL (ri , t ) | 2 ¡2A|zR (ri , t ) | 2
i
h i (2) (3.35) wO ( rij ) zL (rj , t ) C zR (rj , t ) P 0 e4ihij
and @zR (ri , t ) @t
h D (1 C iV 0 ) zR (ri , t ) C f C (ri ) C b
X j2J
D m ¡ A|zR (ri , t ) | 2 ¡ 2A|zL (ri , t ) | 2
i
h i (2) wO ( rij ) zR (rj , t ) C zL (rj , t ) P 0 e ¡4ihij , (3.36)
where mc f§ (ri ) D lim T!1 T
Z T/ 2 Z ¡T/ 2
p 0
e ¡i(V0 t § 2w )
X
dw e Bl D hl (ri , w , t) dt. p l D E,I
(3.37)
It can be seen that the amplitudes couple only to time–dependent inputs from the LGN. 4 Contextual Effects In the previous section, we used perturbation techniques to reduce the original innite-dimensional system of Wilson-Cowan equations (2.4) to a corresponding nite-dimensional system of amplitude equations (3.28). This is illustrated in Figure 6. The complex amplitude z (ri ) D Zi e ¡2iw i determines the linear response function Ri (w ) of the ith hypercolumn according to Ri (w ) D z (ri )e2iw C zN i (ri ) e ¡2iw D 2Zi cos(2[w ¡ w i ]) ,
(4.1)
and this in turn determines the population tuning curves of the excitatory and inhibitory populations. In this section, we use the amplitude equation, 3.28, to investigate (at least qualitatively) how contextual stimuli falling outside the classical receptive eld of a hypercolumn modify their response to stimuli within the classical receptive eld; such contextual effects are mediated by the lateral interactions between hypercolumns. 4.1 A Single Driven Hypercolumn. We begin by determining the response of a single hypercolumn to an LGN input in the absence of lateral interactions (b D 0) and under the assumption that the orientation tuning
510
Paul C. Bressloff and Jack D. Cowan
Figure 6: Reduction from a dynamical model of interacting hypercolumns to a dynamical model of interacting population tuning curves.
curves and external stimuli are stationary. Let the input from the LGN to the ith active hypercolumn be of the form
D hl (ri , w )
D gl Ci cos(2[w ¡ W i ]) ,
(4.2)
where Ci , Ci > 0, is proportional to the contrast of the center stimulus, W i is its orientation, and gl ¸ 0 determines the relative strengths of the feedforward input to the local excitatory and inhibitory populations. Such a stimulus represents an oriented edge or bar in the aggregate receptive eld of the hypercolumn. It then follows from equation 3.30 that f (ri ) D Ci e ¡2iWi ,
(4.3) P
where we have set m c lD E,I BQ l gl D 1 for convenience. This term is positive if gE À gI . Setting b D 0 in equation 3.28 and writing z (ri ) D Zi e¡2iwi , with Zi real, we obtain the following pair of equations: dZi 2 D Zi ( D m ¡ AZi ) C Ci cos(2[w i ¡ W i ]) dt dw i Ci sin(2[w i ¡ W i ]). D ¡ dt 2Zi
(4.4) (4.5)
Assuming that the coefcients D m > 0, A > 0, these have a stable xedpoint solution (w i¤ , Z¤i ) with w i¤ D W i (independent of the contrast), and Z¤i
Amplitude Equation Approach
511
is the positive root of the cubic Zi ( D m ¡AZ2i ) C Ci D 0. Thus, the hypercolumn encodes the orientation W i of a local bar in its aggregate receptive eld by locking the peak of its tuning curve to W i (see Figure 5). The amplitude of the tuning curve is determined by Z¤i , which is an increasing function of contrast. Note that in the absence of an external stimulus (Ci D 0), the tuning curve is marginally stable since w i¤ is arbitrary. This means that the phase will be susceptible to noise-induced diffusion. Recall from Section 3 that the amplitude of the LGN input was assumed to be O (2 3 / 2 ) , whereas the amplitude of the response is O (2 1/ 2 ) . Since 2 ¿ 1, the intrinsic circuitry of the hypercolumn amplies the LGN input. Now consider the oscillatory case. In order that the amplitudes zL and zR of equations 3.35 and 3.36 couple to an LGN input, the latter must be time dependent with a Fourier component at the natural frequency of oscillation V 0 of the hypercolumn. It is certainly reasonable to assume that external stimuli have a time-dependent part. For example, neurophysiological experiments often use ashing or drifting oriented gratings as external stimuli. (These are chosen to compensate for the spike frequency adaptation of neurons in the cortex.) Time-dependent signals might also be generated in the LGN. Suppose that Ci is the contrast of the relevant temporal frequency component and W i is the orientation of the external stimulus. Neglecting lateral inputs, the amplitude equations 3.35 and 3.36 become @zL (ri , t ) @t
D (1 C iV 0 ) zL (ri , t )
h i £ D m ¡ A|zL (ri , t ) | 2 ¡ 2A|zR (ri , t ) | 2 C Ci e ¡2iWi
(4.6)
and @zR (ri , t ) @t
D (1 C iV0 ) zR (ri , t )
h i £ D m ¡ A|zR (ri , t ) | 2 ¡ 2A|zL (ri , t ) | 2 C Ci e2iWi .
(4.7)
If Ci D 0, D m > 0 and A > 0, there are three types of xed-point solution: (1) anticlockwise rotating waves |zL | 2 D D m / A, zR D 0, (2) clockwise rotating waves |zR | 2 D D m / A, zL D 0, and (3) standing waves |zR | 2 D |zL | 2 D D m / 3A. The standing waves are unstable, and the traveling waves are marginally stable due to the existence of arbitrary phases (Kath, 1981; Aronson, Ermentrout, & Kopell, 1990). (Note, however, that it is possible to obtain stable 6 standing waves when c m D 0 in equation 3.21.) If Ci > 0, the traveling wave solutions no longer exist, but standing wave solutions do, of the form zL D Zi e ¡2iWi eiy , zR D Zi e2iW i eiy with tan y D ¡V 0 and Zi a positive root of the cubic Zi [D m ¡ 3AZ2i ] C Ci cos( y ) D 0.
(4.8)
512
Paul C. Bressloff and Jack D. Cowan
Equation 3.32 then implies that the tuning curves are of the form 4Zi cos(V0 t C y ) cos(2[w ¡ W i ]) . It can be shown that these solutions are unstable for sufciently low contrasts but stable for high contrasts. The existence of a stable, time–periodic tuning curve in the case of a single isolated hypercolumn means that it effectively acts as a single giant oscillator. One could then investigate, for example, the synchronization properties of a network of hypercolumns coupled via anisotropic lateral interactions. However, it is not clear how to interpret these oscillations biologically. One possibility is that they correspond to the 40 Hz c frequency oscillations that are thought to play a role in the synchronization of cell assemblies (Gray, Konig, Engel, & Singer, 1989; Singer & Gray, 1995). One difculty with this interpretation is that the c oscillations appear in the spikes of individual neurons, and thus the mechanism for generating them is likely to be washed out in any mean-eld theory analysis used to derive the rate models considered in this article. On the other hand, recent work on integrate–and–re networks has shown that time–dependent ring rates can be viewed as modulations of single neuron spikes (Bressloff et al., 2000). Given the difculties concerning the signicance of oscillatory tuning curves, we assume that the normal operating regime of each hypercolumn is to generate stationary tuning curves in response to slowly changing LGN inputs. 4.2 Center–Surround Interactions. Consider the following experimental situation (Blakemore & Tobin, 1972; Li & Li, 1994; Sillito et al., 1995): a particular hypercolumn designated the “center ” is stimulated with a grating at orientation W c , while outside the receptive area of this hypercolumn—in the “surround”—there is a grating at some uniform orientation W s as shown in Figure 7a. In order to analyze this problem, we introduce a mean-eld approximation. That is, the active region of cortex responding to the centersuround stimulus of Figure 7a is taken to have the conguration shown in Figure 7b. This consists of a center hypercolumn at rc D 0 interacting with a ring of N identical surround hypercolumns at relative positions rj , with |rj | D R, j D 1, . . . , N. Note that the total input from surround to center is strong relative to that from center to surround. Therefore, as a further approximation, we can neglect the latter and treat the system as a single hypercolumn receiving a mixture of inputs from the LGN and lateral inputs from the surround (see Figure 7c). Suppose that the surround hypercolumns are sharply tuned to the stimulus orientation W s , that is, they have steady-state amplitudes of the form z (rj ) D Zs e ¡2iWs , where Zs is positive and real. It follows from equation 3.28 that the complex amplitude zc (t ) characterizing the response of the center
Amplitude Equation Approach
513
Figure 7: (a) Circular center–surround stimulus conguration. (b) Cortical conguration of a center hypercolumn interacting with a ring of surround hypercolumns. (c) Effective single hypercolumn circuit in which there are direct LGN inputs from the center and lateral inputs from the surround. The relative strengths of the lateral inputs to the excitatory (E) and inhibitory (I) populations are determined by bE and bI , respectively. The corresponding LGN input strengths are denoted by gE and gI .
satises the equation dzc (t ) D zc (t ) ( D m ¡ A|zc (t ) | 2 ) dt ± ² Q C Cc e ¡2iWc , C bws e ¡2iW s C L e2iWs C L
(4.9)
where ws D NZs wO ( R) > 0, Cc is the contrast of the center stimulus and L D
(2) N P0 X e ¡4ihj , N jD 1
LQ D
(1) N sP 0 X
NZs
e ¡2ihj ,
(4.10)
jD 1
with hj the direction of the line from the center to the jth surround hypercolumn (see Figure 7b). We now make the simplifying assumption that L D LQ D 0 due to the rotational symmetry of the mean-eld conguration. Writing zc D Zc e ¡2iw c , we then obtain the following pair of equations: dZc D Zc ( D m ¡ AZ2c ) C Cc cos(2[w c ¡ W c ]) dt C bws cos(2[w c ¡ W s ])
dw c bws Cc sin(2[w c ¡ W c ]) ¡ sin(2[w c ¡ W s ]). D ¡ dt 2Zc 2Zc
(4.11) (4.12)
514
Paul C. Bressloff and Jack D. Cowan
Figure 8: Response of center population as a function of the relative orientation of the surround stimulus D W D W s ¡Wc . Parameters in amplitude equations 4.11 O s D 0.8, Cc D 1, and A D 1. (a) Solid line: total and 4.12 are D m D 0.1, |b| w response dZc at preferred orientation as a fraction of the response in the absence of surround modulation. Dashed line: amplitude of maximum response Z¤c (b ) . (b) Plot of the shift in the peak of the orientation tuning curve dw c D W c ¡ w c¤ (b ) .
Recall from equation 3.29 that the effective interaction parameter b has both an excitatory and an inhibitory component. As noted previously, experimental data suggest that the effect of lateral inputs on the response of a hypercolumn is contrast sensitive, tending to be suppressive at high contrasts and facilitatory at low contrasts (Toth et al., 1996; Levitt & Lund, 1997; Polat et al., 1998). In terms of the amplitude equation, 4.9, this implies that b D b ( Cc ) . For the moment, we assume that the contrast of the center stimulus is large enough so that b < 0. Let ( Z¤c (b ) , w c¤ (b ) ) be the unique, stable, steady-state solution of equations 4.11 and 4.12 for a given b. We dene two important quantities: the shift dw c in the peak of the tuning curve and the fractional change dZc in the linear response at the preferred orientation W i , dw c D W c ¡ w c¤ (b ) ,
dZc D
Z¤c (b ) cos(2dw c ) . Z¤c (0)
(4.13)
If dZc < 1, the effect of the lateral inputs is supressive, whereas if dZc > 1, then it is facilitatory. In Figure 8 we plot the resulting quantities dZc and dw c as a function of D W D W s ¡ W c . It can be seen that surround modulation changes from suppression to facilitation as the difference in the orientation of the surround and the center increases beyond around 60± . This is consistent with experimental results on center-surround suppression and facilitation at high contrasts (Blakemore & Tobin, 1972; Li & Li, 1994; Sillito et al., 1995). An important contribution to the asymmetry between suppression and facilitation arises from the shift in the peak of the effective tuning curve in the presence of the surround, as measured by dw c . Since this is positive, there is an apparent increase in the difference between the center and surround tuning curve peaks, which is analogous to the direct tilt effect observed in psychophysical experi-
Amplitude Equation Approach
515
ments (Wenderoth & Johnstone, 1988), although the effect is much larger here. We now give a simple argument for the switch from suppression to facilitation as the relative angle D W D W s ¡ W c is increased. If the surround is stimulated by a grating parallel to that of the center stimulus (W s D W c ), then w c¤ D W c and equation 4.11 becomes dZc D Zc ( D m ¡ AZ2c ) C Cc C bws . dt
(4.14)
Similarly, in the case of an orthogonal surround (W s D W c C 90± ), dZc D Zc ( D m ¡ AZ2c ) C Cc ¡ bws . dt
(4.15)
The steady-state amplitude Z¤c for b D 0 is an increasing function of Cc (see Section 4.1). Hence, if b < 0, then Cc ! Cc ¡ |bwc | for a collinear surround, resulting in a suppression of the center response, whereas Cc ! Cc C |bwc | for an orthogonal surround leading to facilitation. The occurrence of a facilitatory response to an orthogonal surround stimulus has been observed experimentally by a number of groups (Blakemore & Tobin, 1972; Sillito et al., 1995; Levitt & Lund, 1997; Polat et al., 1998). At rst sight, however, this conicts with the consistent experimental nding that stimulating a hypercolumn with an orthogonal stimulus suppresses the response to the original stimulus. In particular, DeAngelis, Robson, Ohzawa, and Freeman (1992) show that cross–orientation suppression (with orthogonal gratings) originates within the receptive eld of most cat neurons examined and is a consistent nding in both complex and simple cells. The degree of suppression depends linearly on the size of the orthogonal grating up to a critical dimension, which is smaller than the classical receptive eld dimension. Such observations are compatible with the above ndings. In the case of orthogonal inputs to the same hypercolumn, we have the simple linear summation Cc cos(2w ) C C0c cos(2[w ¡p / 2]) D ( Cc ¡C0c ) cos(2w ) , where Cc > C0c > 0. Thus, the orthogonal input of amplitude C0c reduces the amplitude of the original input and hence gives rise to a smaller response. On the other hand, the effect of an orthogonal surround stimulus is mediated by the lateral connections and (in the suppressive setting) is input primarily to the orthogonal inhibitory population, which can lead to disinhibition and consequently facilitation. Similar arguments were used by Mundel et al. (1997) and more recently by Dragoi and Sur (2000) in their numerical study of interacting hypercolumns. The center response to surround stimulation depends signicantly on the contrast of the center stimulation (Toth et al., 1996; Levitt & Lund, 1997; Polat et al., 1998). For example, a xed surround stimulus tends to facilitate responses to preferred orientation stimuli when the center contrast is low
516
Paul C. Bressloff and Jack D. Cowan
Figure 9: Variation in the response Rc of a center hypercolumn at its preferred orientation as a function of (log) contrast Cc for collinear high-contrast surround (solid curve) and no surround (dashed curve). Parameters in the amplitude O s D 0.8. The contrastequations 4.11 and 4.12 are D m D 0.1, A D 1, and w dependent coupling is taken to be of the form b D (0.5 ¡ Cc ) .
but suppresses responses when it is high. Both effects are strongest when the center and surround stimuli are at the same orientation. It has recently been shown that inclusion of some form of contrast-related asymmetry between local excitatory and inhibitory neurons is sufcient to account for the switch between low-contrast facilitation and high-contrast suppression (Somers et al., 1998). One way to model the thresholding properties of the lateral interactions is to distinguish between the differing classes of inhibitory interneurons. For example, certain types of local interneuron are well placed to provide a source of feedforward inhibition from lateral connections, as illustrated in Figure 1. This feedforward inhibition will have its own threshold and gain that will be determined by the contrast of the LGN inputs, and there will thus be a contrast-dependent disynaptic inhibition arising from the lateral connections. We incorporate such an effect into our cortical model by including a contrast-dependent contribution to the lateral inputs of the excitatory population in equation 2.4: bE ! b E ¡ b ¤ , where b ¤ is contrast dependent. In the case of high contrasts, we expect b ¤ to be sufciently large so that the effective coupling b < 0. This has been assumed in our analysis of high-contrast center-surround stimulation. Now suppose that we consider the response of the center in the presence of a high-contrast surround and varying center contrast. For the sake of illustration, suppose that b ¤ varies linearly with the contrast. The response of the center at its preferred orientation, Rc , is plotted as a function of contrast in Figure 9, both with and without a collinear surround. It can be seen that there is facilitation at
Amplitude Equation Approach
517
low contrasts and suppression at high contrasts, which is consistent with experimental data (Polat et al., 1998). 4.3 Anisotropy in Surround Suppression. So far we have considered a surround conguration in which the the anisotropic nature of the lateral connections is effectively averaged away. This is no longer the case if only a subregion of the surround is stimulated. For concreteness, suppose that only a single hypercolumn in the surround is active, and take its location relative to the center to be r D r (cosh , sin h ) (see Figure 7b). Since the effects of feedback from the center hypercolumn to the surround cannot now be ignored, we have to solve the pair of equations dzc (t ) D zc (t ) ( D m ¡ A|zc (t ) | 2 ) C Ce ¡2iWc dt h i (2) (1) C bw O ( r) zs C zN s P 0 e ¡4ih C sP 0 e ¡2ih dzs (t ) 2 ¡2iW s D zs (t ) ( D m ¡ A|zs (t ) | ) C Ce dt h i (2) (1) C bw O ( r) zc C zN c P 0 e ¡4ih C sP 0 e ¡2ih ,
(4.16)
(4.17)
where for simplicity the contrast of the center and surround stimuli is taken to be equal. It is clear that if the lateral connections are isotropic, then p 0 (w ) D (2) (1) 1 for all w 2 [¡p / 2, p / 2) , so that P 0 D 0 D P 0 , and the effective interaction between the center and surround is independent of their relative angular (1) (2) location h. However, if P 0 and P 0 are nonzero, then there will be some dependence on the relative angle h, which is a signature of the anisotropy of the lateral connections. For the sake of illustration, suppose that W c D W s D 0 (iso-oriented center and surround). Then, by symmetry, a xed-point solution exists for which zc D zs D Ze ¡2iw , where Z ( D m ¡ AZ2 ) C C cos(2w ) C b wO ( r) Z h i (2) (1) £ 1 C P 0 cos(4[w ¡ h ]) C sP 0 cos(2[w ¡ h]) D 0 h i (2) (1) C sin(2w ) C b wO ( r) P 0 sin(4[w ¡h]) C sP 0 sin(2[w ¡h]) D 0.
(4.18) (4.19)
Some solutions of these equations are plotted in Figure 10 with dZc D Z (b ) cos(w ) / Z (0). It can be seen that the degree of suppression of the center (at high-contrast) depends on both the relative angular position of the surround hypercolumn and the degree of spread in the lateral connections as (1) (2) characterized by the parameters P 0 and P 0 . For certain parameter values
518
Paul C. Bressloff and Jack D. Cowan
Figure 10: Variation in the degree of suppression of a center hypercolumn as a function of the relative angular position h of a surround hypercolumn. Both center and surround are stimulated by a high-contrast horizontal bar such that h D 0± corresponds to a collinear or end-ank conguration and h D 90± corresponds to an orthogonal or side-ank conguration as shown. The two curves (1) (2) differ in the choice of spread parameters: (i) sP 0 D 0.1, P 0 D 0.6 and (ii) (1) (2) sP0 D 0.8, P0 D 0.4. Other parameters in equations 4.16 and 4.17 are m D 0.1, A D 1, C D 1.0, and |b |w(r) D 0.5.
(see curve ii in Figure 10), the suppressive effect of the surround hypercolumn is maximal when located close to the end-ank position and minimal when located around the side-ank position. However, it is also possible for the suppressive effect to be minimal at some oblique conguration (see curve i in Figure 10). It is interesting that recent experimental data concerning the spatial organization of surrounds in primary visual cortex of the cat imply that in many cases, suppression originates from a localized region in the surround rather than being distributed uniformly in the region encircling the center ’s classical receptive eld, that is, the surrounds are spatially asymmetric (Walker, Ohzawa, & Freeman, 1999). The suppressive portion of the surround can arise at any location with a bias toward the ends of the center’s classical receptive. Our analysis suggests that the effect of the surround depends on a subtle combination of the spread in the anisotropic lateral connections and the spatial organization of the surround. 5 Discussion The mathematical model introduced here assumes that V1 dynamics is close to threshold, so that even weakly modulated signals from the LGN can trigger a tuned response. One way to achieve this is if there is a balance between intracortical excitation and inhibition. This and related ideas have been explored in a number of recent studies of the role of intracortical interactions on the observed properties of cortical neurons (Douglas et al., 1995; Carandini et al., 1997; Tsodyks & Sejnowski, 1995; Chance & Abbott, 2000; Wielaard, Shelley, McLaughlin, & Shapley, 2001).
Amplitude Equation Approach
519
5.1 Balanced Excitation and Inhibition. As discussed by Tsodyks and Sejnowski (1995), when a network with balanced excitation and inhibition is close to threshold, several properties obtain. (1) Fluctuations about the threshold are large, and remain so, even in the continuum limit. (2) The network switches discontinuously from threshold to a tuned excited state, indicating the existence of a subcritical bifurcation. (3) Critical slowing down occurs, that is, the network state changes more slowly from threshold to tuned the closer the stimulus is to threshold (see also Cowan & Ermentrout, 1978)—given realistic neural parameters it takes about 50 msec to reach the peak of the tuned state. However, (4) if the input orientation bias is switched instantaneously to a new value, the net switches to the new tuned state in less than 50 msec, particularly if the inhibition in the network is faster than the excitation. Finally, (5) the orientation-tuned response is contrast invariant. Most, if not all, of these properties are found in the mathematical model described above. On a more technical level, we note that one consequence of having balanced excitation and inhibition is that to have any effect on the bifurcation process whereby new orientation tuned stable states emerge, external stimuli, for example, those from the LGN, which we labeled as hl (rj , w ) , must be of the form hl (rj , w ) D gl Cj f1 ¡ 2 3/ 2 C 2 3 / 2 cos(2[w ¡ Wj ]) g, where gl measures the relative strengths of LGN inputs to excitatory and inhibitory populations in V1, and Cj is the stimulus contrast at the jth hypercolumn. Thus, the nonoriented component (1 ¡ 2 3/ 2 ) gl Cj is 0(1) and couples to the steady-state equations, whereas the orientation-tuned component 2 3 / 2 gl Cj cos(2[w ¡ Wj ]) is 0(2 3 / 2 ) and couples to the cubic amplitude equations. It follows that if V1 is not sitting close to a bifurcation point, so that much stronger modulations are needed to drive it to an orientation-tuned state, then a different mathematical analysis is needed. In this respect, it is clear that the noise uctuations described above can play an important role in keeping the network dynamics close to the bifurcation point. As recently noted by Anderson, Lampl, Gillespie, and Ferster (2000), noise uctuations can also secure contrast invariance in the tuned response of a population of spiking neurons, which is a property we found in the uncoupled hypercolumn model (Bressloff et al., 2000) using sigmoid population equations that effectively take account of effects produced by stochastic resonance. 5.2 Divisive Inhibition. Another well-known set of studies concerns the use of recurrent inhibition to model the nearly linear behavior of simple cortical cells. Thus, Carandini et al. (1997) have explored the use of shunting or divisive inhibition provided by a pool of inhibitory neurons, to linearize neural responses and produce contrast invariance and satura-
520
Paul C. Bressloff and Jack D. Cowan
tion properties consistent with observations. As Majaj, Smith, and Movshon (2000) recently showed, many contextual effects can be modeled within this framework if the inhibition acts to normalize the neural response. Thus, cross–orientation suppression within the classical receptive eld is easily obtained. In general, many suppression experiments can be modeled with such a mechanism, but it is more difcult to account for effects such as cross–orientation facilitation by stimuli in the nonclassical receptive eld. The idea of using division rather than subtraction to model the effects of inhibition has been around for a long time. One early mathematical use was by Grossberg (1973), who developed population equations similar to those of Wilson and Cowan (1972, 1973), but using divisive inhibition. An immediate problem is then to nd a biophysical basis for such a mechanism. Recent work by Wielaard et al. (2001) has provided a natural solution. If one uses conductance-based models of the evolution of changes in membrane potentials in a network of integrate–and–re neurons with recurrent excitation and inhibition and if the conductances are sufciently large, then the steady-state voltages reached depend on both differences between excitation and inhibition, and also on ratios of conductances, in such a way as to implement divisive inhibition. One can then use such conductance models to see how linear simple cell properties can emerge from network interactions. It is the balance between excitation and inhibition that produces effective linearity. A similar but less biophysically direct model was also recently introduced by Chance and Abbott (2000). However, they went further in devising a circuit to maintain rapid switching in the presence of critical slowing down by using a three-population model comprising one excitatory and two (divisive) inhibitory populations. This circuit is more elaborate than that of Tsodyks and Sejnowski (1995) and in fact is very close to the three-neuron circuit we have studied in this article (see Figure 1), except that we have used subtractive inhibition. But the mathematical analysis carried out in this article proceeds by rst linearizing the basic equations about an equilibrium state and then carrying out various perturbation expansions. It follows that either subtractive or divisive inhibition will lead to very similar equations and conclusions. The only difference lies in the particular parameter values needed to obtain the various behaviors. Thus, our results are qualitatively similar to what can be expected from a three-neuron circuit employing divisive rather than subtractive inhibition, particularly as we deal only with steady-state behaviors in this article. Given the framework described above, we note again that the analysis outlined in this article provides an account of both suppression and facilitation in a single model. We expect that the equivalent divisive inhibitory model will also exhibit such properties, at least if stationary tuning curves are analyzed. It remains to be seen what differences, if any, exist in the case of time–dependent responses.
Amplitude Equation Approach
521
5.3 Contrast Dependence. One other benet is provided by the threeneuron circuit described here. In various parts of the article, we have imposed contrast dependence of the lateral connectivity in a somewhat ad hoc manner simply by assuming the effective coupling coefcient b of the lateral connections between hypercolumns to be of the form b D b E ¡ b ¤ ( C) , where bE is the excitatory coupling between hypercolumns and C is the stimulus contrast. Evidently if b ¤ ( C) increases monotonically with contrast, there exists a contrast threshold at which b < 0. One relatively straightforward way to achieve this is to suppose that basket cells, which receive about 20% of the signal from excitatory lateral connections (McGuire et al., 1991), have relatively high thresholds and are therefore not activated by low-contrast stimuli. In effect, this implies that the effective excitatory part of the aggregate hypercolumn receptive eld is larger at low contrast (Movshon, pers. comm., 2001). Acknowledgments We thank Trevor Mundel for many helpful discussions. This work was supported in part by grant 96-24 from the James S. McDonnell Foundation to J. D. C. The research of P. C. B. was supported by a grant from the Leverhulme Trust. P. C. B. wishes to thank the Mathematics Department, University of Chicago, for its hospitality and support. P. C. B. and J. D. C. also thank Geoffrey Hinton and the Gatsby Computational Neurosciences Unit, University College, London, for hospitality and support. References Anderson, S., Lampl, I., Gillespie, D. C., & Ferster, D. (2000). The contribution of noise to contrast invariance of orientation tuning in cat visual cortex. Science, 290 (5498), 1968–1972. Aronson, D. G., Ermentrout, G. B., & Kopell, N. (1990). Amplitude response of coupled oscillators. Physica D, 41, 403–449. Ben-Yishai, R., Bar-Or, R. L., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proc. Nat. Acad. Sci., 92, 3844–3848. Ben-Yishai, R., Hansel, D., & Sompolinsky, H. (1997). Traveling waves and the processing of weakly tuned inputs in a cortical network module. J. Comput. Neurosci., 4, 57–77. Blakemore, C., & Tobin, E. (1972). Lateral inhibition between orientation detectors in the cat’s visual cortex. Exp. Brain Res., 15, 439–440. Blasdel, G. G. (1992). Orientation selectivity, preference, and continuity in monkey striate cortex. J. Neurosci., 12, 3139–3161. Blasdel, G. G., & Salama, G. (1986). Voltage-sensitive dyes reveal a modular organization in monkey striate cortex. Nature, 321, 579–585.
522
Paul C. Bressloff and Jack D. Cowan
Bonhoeffer, T., Kim, D., Malonek, D., Shoham, D., & Grinvald, A. (1995). Optical imaging of the layout of functional domains in area 17/18 border in cat visual cortex. European J. Neurosci., 7(9), 1973–1988. Bosking, W. H., Zhang, Y., Schoeld, B., & Fitzpatrick, D. (1997). Orientation selectivity and the arrangement of horizontal connections in tree shrew striate cortex. J. Neurosci., 17, 2112–2127. Bressloff, P. C., Bressloff, N. W., & Cowan, J. D. (2000). Dynamical mechanism for sharp orientation tuning in an integrate–and–re model of a cortical hypercolumn. Neural Comput., 12, 2473–2511. Bressloff, P. C., Cowan, J. D., Golubitsky, M., Thomas, P. J., & Wiener, M. (2001a). Geometric visual hallucinations, Euclidean symmetry and the functional architecture of striate cortex. Phil. Trans. Roy. Soc. Lond. B, 356, 299–330. Bressloff, P. C., Cowan, J. D., Golubitsky, M., Thomas, P. J., & Wiener, M. (2001b). What geometric visual hallucinations tell us about the visual cortex. Neural Comput., 13, 1–19. Carandini, M., Heeger, D. J., & Movshon, J. A. (1997). Linearity and normalization in simple cells of the macaque primary visual cortex. J. Neurosci., 17(21), 8621–8644. Chance, F. S., & Abbott, L. F. (2000). Divisive inhibition in recurrent networks. Network: Comp. Neural Sys., 11, 119–129. Cowan, J. D. (1997). Neurodynamics and brain mechanisms. In M. Ito, Y. Miyashita, & E. T. Rolls (Eds.), Cognition, computation, and consciousness (pp. 205–233). Oxford: Oxford University Press. Cowan, J. D., & Ermentrout, G. B. (1978). Some aspects of the “eigenbehavior ” of neural nets. In S. Levin (Ed.), Studies in mathematical biology, 15 (pp. 67–117). Providence, RI: Mathematical Association of America. DeAngelis, G., Robson, J., Ohzawa, I., & Freeman, R. (1992). Organization of suppression in receptive fields of neurons in cat visual cortex. J. Neurophysiol., 68(1), 144–163. Douglas, R., Koch, C., Mahowald, M., Martin, K., & Suarez, H. (1995). Recurrent excitation in neocortical circuits. Science, 269, 981–985. Dragoi, V., & Sur, M. (2000). Some properties of recurrent inhibition in primary visual cortex: Contrast and orientation dependence on contextual effects. J. Neurophysiol., 83, 1019–1030. Ermentrout, G. B. (1998). Neural networks as spatial pattern forming systems. Rep. Prog. Phys., 61, 353–430. Ermentrout, G. B., & Cowan, J. D. (1979). A mathematical theory of visual hallucination patterns. Biol. Cybernetics, 34, 137–150. Ferster, D., Chung, S., & Wheat, H. (1997). Orientation selectivity of thalamic input to simple cells of cat visual cortex. Nature, 380, 249–281. Fitzpatrick, D. (2000). Seeing beyond the receptive eld in primary visual cortex. Curr. Op. in Neurobiol., 10, 438–443. Fitzpatrick, D., Zhang, Y., Schoeld, B., & Muly, M. (1993).Orientation selectivity and the topographic organization of horizontal connections in striate cortex. Soc. Neurosci. Abstracts, 19, 424. Gilbert, C., Das, A., Ito, M., Kapadia, M., & Westheimer, G. (1996). Spatial integration and cortical dynamics. Proc. Nat. Acad. Sci., 93, 615–622.
Amplitude Equation Approach
523
Gilbert, C., & Wiesel, T. (1983). Clustered intrinsic connections in cat visual cortex. J. Neurosci., 3, 1116–1133. Gilbert, C., & Wiesel, T. (1989). Columnar specicity of intrinsic horizontal and corticocortical connections in cat visual cortex. J. Neurosci., 9, 2432–2442. Gray, C. M., Konig, P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter–columnar synchronization which reects global stimulus properties. Nature, 338, 334–337. Grinvald, A., Lieke, E., Frostig, R., & Hildesheim, R. (1994). Cortical point– spread function and long–range lateral interactions revealed by real–time optical imaging of macaque monkey primary visual cortex. J. Neurosci., 14, 2545–2568. Grossberg, S. (1973). Contour enhancement, short term memory, and constancies in reverberating neural networks. Studies in Applied Mathematics, 52(3), 213–257. Grossberg, S., & Raizada, R. D. S. (2000). Contrast-sensitive perceptual grouping and object–based attention in the laminar circuits of primary visual cortex. Vision Res., 40, 1413–1432. Hata, Y., Tsumoto, T., Sato, H., Hagihara, K., & Tamura, H. (1988). Inhibition contributes to orientation selectivity in visual cortex of cat. Nature, 335, 815–817. Hirsch, J., & Gilbert, C. (1992). Synaptic physiology of horizontal connections in the cat’s visual cortex. J. Physiol. Lond., 160, 106–154. Hoppensteadt, F., & Izhikevich, E. (1997). Weakly connected neural networks. New York: Springer. Hubel, D., & Wiesel, T. (1962). Receptive elds, binocular interaction and functional architecture in the cat’s visual cortex. J. Neurosci., 3, 1116–1133. Jagadeesh, B., & Ferster, D. (1990). Receptive eld lengths in cat striate cortex can increase with decreasing stimulus contrast. Soc. Neurosci. Abstr., 16, 130.11. Kath, W. L. (1981). Resonance in periodically perturbed Hopf bifurcation. Stud. Appl. Math., 65, 95–112. Levitt, J. B., & Lund, J. (1997). Contrast dependence of contextual effects in primate visual cortex. Nature, 387, 73–76. Li, C., & Li, W. (1994). Extensive integration eld beyond the classical receptive eld of cat’s striate cortical neurons—classication and tuning properties. Vision Res., 34(18), 2337–2355. Li, Z. (1999). Pre-attentive segmentation in the primary visual cortex. Spatial Vision, 13, 25–39. Majaj, N., Smith, M. A., & Movshon, J. A. (2000). Contrast gain control in macaque area MT. Soc. Neuroci. Abstr.. Malach, R., Amir, Y., Harel, M., & Grinvald, A. (1993). Relationship between intrinsic connections and functional architecture revealed by optical imaging and in vivo targeted biocytin injections in primate striate cortex. Proc. Natl. Acad. Sci., 90, 10469–10473. McGuire, B., Gilbert, C., Rivlin, P., & Wiesel, T. (1991). Targets of horizontal connections in macaque primary visual cortex. J. Comp. Neurol., 305, 370–392. McLaughlin, D., Shapley, R., Shelley, M., & Wielaard, D. J. (2000). A neuronal network model of macaque primary visual cortex (V1): Orientation tuning and dynamics in the input layer 4Ca. Proc. Natl. Acad. Sci., 97, 8087–8092.
524
Paul C. Bressloff and Jack D. Cowan
Michalski, A., Gerstein, G., Czarkowska, J., & Tarnecki, R. (1983). Interactions between cat striate cortex neurons. Exp. Brain Res., 53, 97–107. Mundel, T. (1996). A theory of cortical edge detection. Unpublished doctoral dissertation, University of Chicago. Mundel, T., Dimitrov, A., & Cowan, J. D. (1997). Visual cortex circuitry and orientation tuning. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 886–893). Cambridge, MA: MIT Press. Obermayer, K., & Blasdel, G. (1993). Geometry of orientation and ocular dominance columns in monkey striate cortex. J. Neurosci., 13, 4114–4129. Phleger, B., & Bonds, A. B. (1995). Dynamic differentiation of GABA A –sensitive inuences on orientation selectivity of complex cells in the cat striate cortex. Exp. Brain Res., 104, 81–88. Polat, U., Mizobe, K., Pettet, M. W., Kasamatsu, T., & Norcia, A. M. (1998). Collinear stimuli regulate visual responses depending on a cell’s contrast threshold. Nature, 391, 580–584. Pugh, M. C., Ringach, D. L., Shapley, R., & Shelley, M. J. (2000). Computational modeling of orientation tuning dynamics in monkey primary visual cortex. J. Comput. Neurosci., 8, 143–159. Rockland, K. S., & Lund, J. (1983). Intrinsic laminar lattice connections in primate visual cortex. J. Comp. Neurol., 216, 303–318. Sceniak, M. P., Ringach, D. L., Hawken, M. J., & Shapley, R. (1999). Contrast’s effect on spatial summation by macaque V1 neurons. Nature Neurosci., 2, 733–739. Sillito, A. M. (1975). The contribution of inhibitory mechanisms to the receptive– eld properties of neurones in the striate cortex of the cat. J. Physiol. Lond., 250, 305–329. Sillito, A. M., Grieve, K. L., Jones, H. E., Cudeiro, J., & Davis, J. (1995). Visual cortical mechanisms detecting focal orientation discontinuities. Nature, 378, 492–496. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Ann. Rev. of Neurosci., 18, 555–586. Somers, D., Nelson, S., & Sur, M. (1994). Effects of long–range connections on gain control in an emergent model of visual cortical orientation selectivity. Soc. Neurosci. Abstr., 20, 646.7. Somers, D., Nelson, S., & Sur, M. (1995). An emergent model of orientation selectivity in cat visual cortical simple cells. J. Neurosci., 15, 5448–5465. Somers, D., Todorov, E. V., Siapas, A. G., Toth, L. J., Kim, D.-S., & Sur, M. (1998). A local circuit approach to understanding integration of long–range inputs in primary visual cortex. Cerebral Cortex, 8, 204–217. Stetter, M., Bartsch, H., & Obermayer, K. (2000). A mean eld model for orientation tuning, contrast saturation, and contextual effects in the primary visual cortex. Biol. Cybernetics, 82, 291–304. Toth, L. J., Rao, S. C., Kim, D., Somers, D., & Sur, M. (1996). Subthreshold facilitation and suppression in primary visual cortex revealed by intrinsic signal imaging. Proc. Natl. Acad. Sci., 93, 9869–9874. Tsodyks, M., & Sejnowski, T. (1995). Rapid state switching in balanced cortical network models. Network: Comp. Neural Sys., 6, 1–14.
Amplitude Equation Approach
525
Vidyasagar, T. R., Pei, X., & Volgushev, M. (1996).Multiple mechanisms underlying the orientation selectivity of visual cortical neurons. TINS, 19(7), 272–277. Walker, G. A., Ohzawa, I., & Freeman, R. D. (1999). Asymmetric suppression outside the classical receptive eld of the visual cortex. J. Neurosci., 19, 10536– 10553. Wenderoth, P., & Johnstone, S. (1988). The different mechanisms of the direct and indirect tilt illusions. Vision Res., 28, 301–312. Wielaard, D., Shelley, M., McLaughlin, D., & Shapley, R. (2001). How simple cells are made in a nonlinear network model of the visual cortex. J. Neurosci., 21, 5203–5211. Wiener, M. C. (1994). Hallucinations, symmetry, and the structure of primary visual cortex: A bifurcation theory approach. Unpublished doctoral dissertation, University of Chicago. Wilson, H. R., & Cowan, J. D. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophys. J., 12, 1–24. Wilson, H. R., & Cowan, J. D. (1973). A mathematical theory of the functional dynamics of cortical and thalamic nervous tissue. Kybernetik, 13, 55–80. Yoshioka, T., Blasdel, G. G., Levitt, J. B., & Lund, J. (1996). Relation between patterns of intrinsic lateral connectivity, ocular dominance, and cytochrome oxidase—reactive regions in macaque monkey striate cortex. Cerebral Cortex, 6, 297–310. Received January 10, 2001; accepted May 25, 2001.
LETTER
Communicated by Fred Rieke
Derivation of the Visual Contrast Response Function by Maximizing Information Rate Allan Gottschalk
[email protected] Department of Anesthesiology and Critical Care Medicine, Johns Hopkins Medical Institutes, Baltimore, MD 21287, U.S.A. A graph of neural output as a function of the logarithm of stimulus intensity often produces an S-shaped function, which is frequently modeled by the hyperbolic ratio equation. The response of neurons in early vision to stimuli of varying contrast is an important example of this. Here, the hyperbolic ratio equation with a response exponent of two is derived exactly by considering the balance between information rate and the neural costs of making that information available, where neural costs are a function of synaptic strength and spike rate. The maximal response and semisaturation constant of the neuron can be related to the stimulus ensemble and therefore shift accordingly to exhibit contrast gain control and normalization. 1 Introduction
The response of neurons in early vision to variations in stimulus contrast has been studied extensively (Albrecht & Hamilton, 1982; Sclar & Freeman, 1982; Albrecht, Farrar, & Hamilton, 1984; Ohzawa, Sclar, & Freeman, 1985; Sclar, Maunsell, & Lennie, 1990; Bonds, 1991; Geisler & Albrecht, 1992; Wilson & Humanski, 1993) and is an important example of the many types of neural data that can be t by the hyperbolic ratio equation (Naka & Rushton, 1966), R D RMaxCn / ( Cn C Cn1/ 2 ) ,
(1.1)
where the response R is an S-shaped function of the logarithm of the stimulus amplitude C, RMax is the maximal possible response, n is the response exponent, and C1 / 2 is the semisaturation constant, the amplitude of the stimulus that will produce a response of RMax / 2. This relationship is also well known from enzyme kinetics (Koshland, Goldbeter, & Stock, 1982). What is less clear is the role that such a relationship plays in the processing of sensory inputs, particularly the stimulus-dependent alterations in RMax and/or C1 / 2 that are often observed in cortical neurons but not those of the lateral geniculate nucleus (Albrecht et al., 1984; Ohzawa et al., 1985; Bonds, 1991; Geisler & Albrecht, 1992). Hypotheses regarding these stimulus-dependent Neural Computation 14, 527–542 (2002)
° c 2002 Massachusetts Institute of Technology
528
Allan Gottschalk
alterations range from neural fatigue to a neural strategy designed to maximize sensitivity to adjustments in stimulus amplitude (Albrecht et al., 1984; Ohzawa et al., 1985; Ullman & Schechtman, 1982). 2 Derivation of the Hyperbolic Ratio Equation
To investigate these possibilities and to motivate the existence of an Sshaped neural response to sensory stimuli, consider the simple neural model in Figure 1A, where the objective is to adjust the size of the overall neural gain b to maximize the amount of information conveyed to the output about the input subject to the costs associated with increases in neural gain and neural output power. In this context, the neural gain b is the product of both the transduction of the input at the synaptic level, where both preand postsynaptic mechanisms could be involved, and the translation of the resulting membrane potential to a given neural spike rate, whose power is also considered to be a limited neural resource. For the model of Figure 1A, the mutual information between g and h is well known (Gallegher, 1968) and given by 1 / 2 log(1 C b 2 l / sn2 ) , where l is the input power and sn2 is the power of the neural noise. If neural gain is penalized as C 0b m and neural output power is penalized as k 0b 2 l, where C 0 and k 0 are, respectively, the relative costs of neural gain and neural output power, then the balance between mutual information and neural resource use is given by the criterion function 1 / 2 log(1 C b 2 l / sn2 ) ¡ C 0 b m ¡ k 0 b 2 l.
(2.1)
The value of the neural gain b, which maximizes this expression, can be obtained by differentiating the expression with respect to b, setting the result equal to zero, and solving for b. For m D 2, the total output power (b 2 l C sn2 ) at the optimum is given exactly by P D (2k 0 ) ¡1 C2 / ( C2 C C 0 /k 0 ) ,
(2.2)
since the input power is obtained from the square of the stimulus amplitude (l D C2 ). Thus, Rmax » (2k 0 ) ¡1 , C1 / 2 » (C 0 /k 0 ) 1 / 2 and the response exponent n is equal to two in order to obtain the stimulus power from the stimulus amplitude. This relationship is depicted by the middle curve of Figure 1B. Although neural output power has a natural denition in the form of b 2 C2 against which a penalty can be assessed, there is no natural penalty function for the neural gain b. However, as shown in Figure 1B, the optimal solution is somewhat robust with respect to the choice of a penalty function for neural gain from the family b m , although the corresponding solutions are not nearly as tractable as equation 2.2 and must be obtained numerically. The above results readily generalize to the case where multiple inputs are involved without changing the form of equation 2.2. To demonstrate this,
Visual Contrast Response Function and Information Rate
529
Figure 1: Simplied neural channel and its optimal response function. (A) Zeromean gaussian stimulus ensemble g of power l undergoes neural gain b and corruption by an independent zero-mean gaussian noise source n of power sn2 to produce the neural output h. (B) Optimal neural output power as a function of input amplitude for the gain that optimizes the balance between information rate and the neural costs of making that information available. Although both neural output power and neural gain are penalized, there is a natural denition for neural output power only; consequently, a penalty on neural gain proportional to b m is considered for a range of m, where the value of m corresponding to each curve increases in the direction of the arrow. The optimal solution for the case m D 2 corresponds to the hyperbolic ratio equation, whereas the other results must be obtained numerically.
consider the same neural model as in Figure 1A except that multiple inputs are multiplied by the vector ’ prior to being summed and corrupted by an additive gaussian noise source of power sn2 . Thus, the mutual information
530
Allan Gottschalk
between the input and output of the neuron becomes 1 / 2 log(1 C ’T C’ / sn2 ) , where C is the correlation matrix of the inputs, and () T denotes the vector transpose. In this context, the neural output power becomes ’T C’, and the cost of neural gain is ’T ’. The optimum can then be obtained from 2 @f1 / 2 log(1 C ’T C’ / sn ) ¡ C 0 ’T ’ ¡ k 0’T C’g / @’ D 0.
(2.3)
This can be solved by inspection and the use of earlier results. The optimal ’ D b v is proportional to the eigenvector v of C, which is associated with the largest eigenvalue l of C, since this provides the greatest power with the smallest use of neural gain. If v is normalized such that v T v D 1, the power of the optimal solution is given by P D (2k 0 ) ¡1 l / (l C C 0 /k 0 ) ,
(2.4)
where the exponent of two is present implicitly since the eigenvalue l is already an expression of stimulus power. Note that a penalty only on neural output power leads to an optimal solution where the relative distribution of the gains is arbitrary as long as they are sufciently large. 3 Contrast Gain Control
Thus far, these results have assumed that if justied by the information rate, sufcient neural resources will be available to achieve the optimum. However, it is conceivable that neural gain is regulated so that it remains bounded. This situation is investigated by adjoining the constraint b m · SMax to the expression of equation 2.1 with the Lagrange multiplier C(Luenberger, 1969), and obtaining the optimum as before. The optimal neural output power when m D 2 becomes P D (2k 0 ) ¡1 C2 / ( C2 C (C 0 C C ) /k 0 ) ,
(3.1)
where C is zero unless the constraint is binding (b m D SMax ) , and where C is obtained algebraically by jointly solving equation 3.1 and the binding constraint. Note that the form of equation 3.1 is the same as equation 2.2, except that C 0 C C appears in place of C 0 . In this context, the Lagrange multiplier C expresses the improvement in information rate that would occur for an increase in SMax and therefore renes the balance between information rate and neural gain induced by C 0. If an adapting contrast determines the value of C but the neuron is still free to instantaneously adjust its output power accordingly, the contrast response function (CRF) will shift to the right in a manner analogous to physiologic contrast gain control (Albrecht et al., 1984; Ohzawa et al., 1985), as shown in Figure 2A for one-octave steps in the adapting contrast. When determining the contrast response of neurons in the visual pathway, drifting sinusoids of varying contrast are typically employed as the
Visual Contrast Response Function and Information Rate
531
Figure 2: (A) Shift of contrast response function (CRF) to the right for oneoctave increments in the contrast (3.1%, 6.3%,12.5%, 25%) of an adapting stimulus when neural gain is constrained. (B) CRFs for one-octave increments in the contrast (3.1%, 6.3%,12.5%, 25%, 50%) of the adapting stimulus when both neural gain and neural output power have an impact on some common neural substrate, such as energy. For both panels, the curve on the left represents the unadapted state.
532
Allan Gottschalk
stimulus, and the response is reported as the spike rate obtained from the peristimulus time histogram from either the DC or rst harmonic (Albrecht & Hamilton, 1982), depending on the linearity of the neuron being recorded from (e.g., lateral geniculate, simple, or complex). Spike rate is the usual output variable reported for physiological determinations of the contrast response and is the mean of the underlying point process. Consequently, neural output power is equated with neural spike rate in Figures 2 and 4 since, to a rst approximation, the variance (power) and mean of such a process are proportional (Fienberg, 1974; Dean, 1981; Tolhurst, Movshon, & Dean, 1983). Consider what takes place if both neural gain and neural output power affect some regulated common neural substrate, such as cellular energy. In this case, the constraint b 2 l C rb m · Emax is adjoined to the criterion function of equation 2.1 with the Lagrange multiplier m , where r quanties the relative impact of neural output power and neural gain on Emax . When m D 2, the optimal solution can be obtained using the previously described approach to reveal P D (2(k 0 C m ) ) ¡1 C2 / ( C2 C (C 0 C mr ) / (k 0 C m ) ) ,
(3.2)
where m D 0 unless the constraint is binding (b 2 l C rb m D Emax ), and where m can be obtained algebraically by jointly solving equation 3.2 and the binding constraint. Again, the optimal solution takes the same form as the prior equations, where only the constants differ. Figure 2B illustrates the shift in the CRF with one-octave steps in the adapting stimulus. In addition to a rightward shift in the CRF, a decrease in the maximal response occurs. Both of these effects are often observed in physiological measurements of contrast gain control (Albrecht et al., 1984; Ohzawa et al., 1985). 4 Impact of Static Nonlinearity on Information Rate
Apart from preventing unbounded use of neural resources, there is an additional motivation for the nervous system to regulate neural gain. Consider an implementation of the above relationships, where a static nonlinearity adjusts output power in accordance with equation 2.2. Although the inputs may be gaussian, the outputs of the nonlinearity are not, and the extent of the deviation from a gaussian distribution depends on the level of input power relative to the semisaturation constant. In addition to inducing distortions that the visual processor may or may not be able to compensate for, nonlinearities also limit the information rate since gaussian signals are the most information rich of all possible distributions (Gallegher, 1968). The decrease in information transmission due to the nonlinearity is illustrated in Figure 3, where the stimulus is processed in a push-pull manner (Palmer & Davis, 1981), with the nonlinearity applied separately to the positive and negative portions of the signal, which are then recombined with an appropriate
Visual Contrast Response Function and Information Rate
533
Figure 3: Static nonlinearity and its impact on information rate. (A) The input g and its negative value are half-wave rectied, passed through the static nonlinearity, and recombined with the appropriate sign before being corrupted by an additive independent zero-mean gaussian noise source. (B) Mutual information between g and h as a function of the semisaturation constant of the static nonlinearity for one-octave increments in the neural noise (sn2 D 0.125, 0.25, 0.50, 1.0) where the stimulus power is 1.0. The mutual information for the nonlinear case (solid line) is shown along with that for a gaussian output of equal power (dashed line), where the curve with the smallest level of neural noise is shown at the top. The vertical line is the value of the semisaturation constant at which for half of the stimuli the amplitude of the input is at or below the value of the semisaturation constant. Mutual information is expressed in units of nats, which reect the use of natural logarithms, and may be converted to bits by dividing by ln 2.
sign and corrupted by an additive independent gaussian noise source. The mutual information between the neural input and output is computed as a function of the semisaturation constant of the nonlinearity. Separate curves
534
Allan Gottschalk
are given for one-octave increments in the power of the neural noise while the stimulus input power remains constant. The results are contrasted with those for a gaussian process of the same output power. Note that differences between the gaussian (linear) and nongaussian (nonlinear) case are not observed until approximately the point (vertical line) where half of the stimuli have amplitudes below the value of the semisaturation constant and half above. Thus, the regulation of neural gain helps to conserve this resource and keeps the information rate close to its optimum value. In this sense, an increase in the semisaturation constant represents the increasing cost of neural gain at higher stimulus power since the corresponding information rate is less because of the nongaussian distribution of neural output induced by the nonlinearity. 5 Normalization
It is conceivable that there may be an upper limit to a neuron’s output, despite the otherwise clear benets of being able to increase its output in response to larger stimuli. Because of the explicit relationship between synaptic size and neural output power for the model of Figure 1A, a single neuron cannot, in general, have binding constraints on both. However, when multiple neurons are present, binding constraints on total neural gain and output power are possible. Consider constraints of the form Si bim · Smax on neural gain and Si bi2 li · Pmax on total neural output power, which are adjoined to the criterion function by the Lagrange multipliers C and k , respectively, and where the different neurons are indexed by i. The constraint on total neural power induces a type of normalization in the sense that neural power is collectively adjusted so that its sum remains within a given limit (Heeger, 1992a). Because multiple neurons are involved, each contributes a term of the form of equation 2.2 to the criterion function to which the above constraints are adjoined. This will be the case as long as the neural outputs remain uncorrelated. This permits an analytic solution for the optimal output in terms of the Lagrange multipliers, which is found to be P i D (2(k 0 C k ) ) ¡1 C2i / ( C2i C (C 0 C C ) / (k 0 C k ) ) ,
(5.1)
which is the same as equation 3.1 except that k 0 C k now appears in place of k 0 , and C now corresponds to the constraint on total neural gain. Closedform solutions for the Lagrange multipliers cannot generally be obtained, and their solution must be obtained numerically with techniques such as Newton’s method (Luenberger, 1969). Although neural outputs are not necessarily uncorrelated, there are specic physiological examples to which this expression can be applied. Consider the example where a counterphase grating is adjusted so that the output of a given cortical neuron is zero, and then a test grating that stimulates the neuron is added to the stimulus (Geisler & Albrecht, 1992).
Visual Contrast Response Function and Information Rate
535
Figure 4: Shift in the CRF to the right and down in response to one-octave increments in stimulus contrast to surrounding neurons when total output power and neural gain are constrained. The use of linear axes permits these results to be compared with those in the physiological literature (Geisler & Albrecht, 1992). See the text for details.
For a xed contrast level of the counterphase stimulus, the CRF of the neuron is then obtained by varying the contrast of the test stimulus. If the Lagrange multipliers from equation 5.1 are computed for each contrast level of the counterphase stimulus, a family of CRFs is generated that shift down and to the right in response to one-octave increments in the contrast level of the counterphase stimulus, as shown in Figure 4. Note that considering only a limit on total neural output power will not lead to a shift in the CRF that is both down and to the right, since increasing k by itself would lead to a decrease in the value C 0 / (k 0 C k ) of the semisaturation constant. 6 Discussion
The hyperbolic ratio equation with a response exponent of two can be derived exactly from a simple information processing model that balances
536
Allan Gottschalk
information transmission against the costs of obtaining that information. This leads to an interpretation of neural data approximated by this relationship as efforts to maximize the extraction of sensory information from the environment with available neural resources. As derived here, the important parameters, but not the shape, of the hyperbolic ratio equation are functions of the costs of and constraints on the available neural resources, and these results lead to shifts in the CRF analogous to physiological descriptions of contrast gain control and normalization. Although simplistic, the model described here would lead to very different conclusions if either synaptic size or neural output power did not in some way constrain the acquisition of information. A system for which there was no penalty on synaptic size (C 0 D 0 in equation 2.2) should always operate near its maximal power, whereas one for which there was no constraint on neural output power (k 0 D 0 in equation 2.2) should linearly increase its output as a function of stimulus power. These results provide a basis for the expansive nonlinearity that is thought to underlie a large array of visual data, including directional selectivity (Albrecht & Geisler, 1991; Heeger, 1992a, 1992b, 1993). Moreover, the interpretation of the CRF given here provides a means of estimating the balance between information rate and the neural resources allocated to making that information available, since the semisaturation constant and maximal response are readily estimated from the CRF and may be related to C 0 , C, k 0 , and k for a family of CRFs obtained under different adaptation conditions. The full neurobiological justication for using mutual information as a component of the criterion function remains to be established. In the current context, its use sets an upper bound on system performance, and the extent that the results presented here are compatible with established physiology argue that mutual information is relevant. The chief concern about very general information measures based on the stimulus ensemble is that, by themselves, they presume that transduction of each member of the stimulus ensemble is important to the survival of the organism and species in proportion to its presence in the stimulus ensemble. However, because the current model determines an optimal gain only as a function of stimulus contrast, as opposed to a spatial or temporal lter, it is not emphasizing a particular component of the stimulus ensemble and only seeks to preserve best the outputs of earlier ltering operations without undue burden to the system. Since redundancy is the only means of preserving information in a noisy environment and redundancy requires neural resources, the need to strike this balance makes clear the benet of avoiding excessive redundancy (Attneave, 1954; Barlow, 1961). This was recognized in earlier work that used graphical arguments to obtain the CRF of the y eye from the distribution of visual stimuli (Laughlin, 1981). By linking the parameters that govern the shape of the CRF with an information measure, it may be possible to impute these aspects of cortical information processing from contrast response data.
Visual Contrast Response Function and Information Rate
537
An issue that may affect the shape of the derived CRF is the assumption made concerning the neural noise, which was introduced as an independent additive gaussian source. Although this maintained the analytic tractability that permitted the above relationships to be obtained, neural noise is stimulus dependent and varies for the most part as the stimulus amplitude (Fienberg, 1974). In early vision, this relationship can be quite complex with a variance-to-mean ratio greater than the value of one that would be expected for a Poisson process (Dean, 1981; Tolhurst et al., 1983). The incorporation of statistical models that reect this behavior could substantially affect the detailed shape of the model CRF. For the hyperbolic ratio equation that best ts the outputs of such models, this could lead to response exponents that are more consistent with those obtained physiologically (Albrecht & Hamilton, 1982; Sclar et al., 1990). Despite these limitations, the results presented here are signicant in that they represent bounds on system performance since gaussian signals are, for a given variance, the most information rich of all distributions, and additive gaussian noise is, for a given variance, the most corrupting (Gallegher, 1968). The simplied assumptions concerning signal and noise permitted an analytical expression for mutual information that was balanced against the costs of making that information available. These costs considered both the size of the synapse and the output of the neuron, and failure to include either one would result in qualitative behavior grossly at variance with physiological experience. The expression for the cost of neural output was an exact expression for output power, which relates directly to the expenditure of cellular energy, a commodity previously hypothesized to be an important factor in the rate of information transmission (Levy & Baxter, 1996). The hypothesis that cortical neurons should strike a balance between information transmission and ring rate has been given some experimental support (Baddeley et al., 1997). Although it is known that information transfer across a synapse also requires the expenditure of cellular energy (Laughlin, Anderson, O’Carroll, & de Ruyter van Steveninck, 2000), and it is reasonable that larger synaptic connections should cost more than smaller ones, there is no natural expression for the cost of a synapse. Consequently, a family of synaptic cost functions was explored as part of Figure 1 and demonstrated that the properties of the optimal solution did not depend critically on the penalty for synaptic costs. Therefore, in order to facilitate the subsequent analysis, the function that permitted the optimal solution to be obtained analytically was employed, although, in principle, any member of the family of functions evaluated in Figure 1 could have been used. Although the assumptions introduced could account for many types of stimulus-dependent shifts in the CRF, they do not indicate why contrast gain control is present in cortical neurons but not in those of the lateral geniculate nucleus (Ohzawa et al., 1985). The scope of the model may be too limited to address this issue satisfactorily. It may be that such shifts in the responsiveness of lateral geniculate neurons prior to the formation
538
Allan Gottschalk
of cortical receptive elds would introduce distortion that could not be compensated for. A computational test of this hypothesis would require a model of cortical receptive eld formation from lateral geniculate afferents. Another limitation of the model is the value of the response exponent of two, which emerged in the process of obtaining input power from input amplitude. However, measurements of the response exponent that corresponds to the steepness of the CRF appear to vary as a function of location along the visual pathway, species, and cell type. The response exponent is least in the lateral geniculate, larger and closest to two in V1, and even larger in MT (Sclar et al., 1990). Those of the monkey are greater than those of the cat, and those of complex cells may be slightly greater than simple cells (Albrecht & Hamilton, 1982). More realistic assumptions concerning the signal and noise statistics, as discussed above, as well as a more comprehensive model of how receptive elds from earlier points along the visual pathway are combined to generate the receptive elds of subsequent points, may be necessary to account fullyfor the detailed shape of the CRF at these different locations. It is reasonable to speculate as to the extent that implementation of the gain control and normalization schemes described here could be responsible for aspects of the inuence of stimuli from outside the classical receptive eld. Recent evidence supports the concept of a network of horizontal inhibitory interactions that are radially symmetric, decrease their inuence as a function of cortical separation, and are independent of orientation preference (Das & Gilbert, 1999). In the context of the model used to generate the normalization in Figure 4, the interactions described by Das and Gilbert would permit normalization of receptive eld output by local neural output power, where more proximal neural outputs exert the greatest inuence. These relationships may not be straightforward. Although the overall output by the stimulated portion of the cortex would lead to decreases in the maximal response of the unstimulated region, they would also drive the semisaturation constant in the direction of increased receptive eld sensitivity, as long as the limitation on neural gain does not become dominant. These conditions would be particularly favored if regional neural output power was moderated by cortical interactions (Toyama, Kimura, & Tanaka, 1981a, 1981b; Das & Gilbert, 1999) whose spatial extent differed from that of a local neurochemical moderator of synaptic efcacy such as the diffuse retrograde neurotransmitter nitric oxide (Montague, Gancayco, Winn, Marchase, & Friedlander, 1994). These effects could contribute to the marked increases in receptive eld sensitivity seen in experiments with an articial scotoma (Pettet & Gilbert, 1992; DeAngelis, Anzai, Ohzawa, & Freeman, 1995; Gilbert, Das, Ito, Kapadia, & Westheimer, 1996). It is important to recognize that the results presented here do not determine the time course or means by which the above relationships are implemented. However, the form of the optimal solution contains terms involving pre- and postsynaptic correlations, which lead naturally to a Heb-
Visual Contrast Response Function and Information Rate
539
Figure 5: Change in optimal neural output power as stimulus power is varied from some reference value (arrow). These changes are shown as a function of pre- and postsynaptic correlation, which was determined for the optimum at each stimulus power level. Although pre- and postsynaptic correlation is always nonzero, whether the level of neural output at the optimum increases or decreases with respect to that of some reference stimulus depends on the difference between the new stimulus power and the reference level (prior experience). Effectively, prior experience sets a “sliding threshold” with respect to whether the synapse should increase or decrease in strength. The asymptote on the left occurs because eventually, the synapse is set to zero since this may be more efcient than incurring the neural costs necessary to convey relatively little information.
bian synaptic modication scheme (Gottschalk, 1997). An important aspect of this algorithm is that correlated pre- and postsynaptic activity will not always lead to an increase in synaptic strength since this should increase only as long as the resulting information is appropriately balanced against the neural costs of making it available (see Figure 5). Otherwise, it may be optimal for synaptic efcacy to decrease, and in some cases, the information produced by a synapse does not justify its existence at all. Thus, synaptic strength for the optimal solution varies as stimulus power, but the direction of change depends entirely on prior stimulus levels. This balance between information and the neural resources necessary for making it available may be why “sliding threshold” models appear necessary to account for certain types of physiological data (Bienenstock, Cooper, & Munro, 1982; Bear, Cooper, & Ebner, 1987; Clothiaux, Bear, & Cooper, 1991). Examples of socalled anti-Hebbian behavior have emerged in other types of neural network analysis (Plumbley, 1993). In summary, nonlinearities of the form of the hyperbolic ratio equation, which are seen throughout sensory physiology, can be interpreted as a natural means of striking a balance between the information these neurons convey about their inputs and the neural costs of doing so. If either neural gain or output power were not costly and therefore highly regulated, the
540
Allan Gottschalk
form of this nonlinearity should be very different. This balance between information and neural costs may play a vital role in neural development by explaining why Hebbian synapses with correlated pre- and postsynaptic activity might either increase or decrease their efcacy, or even disappear. These results also permit the stimulus-dependent shifts of the CRF in early visual cortex to be interpreted as an effort to maximize information when the resources for doing so are limited Acknowledgments
I thank Judith McLean, Larry A. Palmer, and Alan C. Rosenquist for their many helpful suggestions during the conduct of this work and the preparation of this article. This work was supported by grant EY10915 from the National Institutes of Health References Albrecht, D. G., Farrar, S. B., & Hamilton, D. B. (1984). Spatial contrast adaptation characteristics of neurones recorded in the cat’s visual cortex. J. Physiol. (Lond.), 347, 713–739. Albrecht, D. G., & Geisler, W. S. (1991). Motion selectivity and the contrastresponse function of simple cells in the visual cortex. Visual Neuroscience, 7, 531–546. Albrecht, D. G., & Hamilton, D. B. (1982). Striate cortex of monkey and cat: Contrast response function. J. Neurophysiol., 48, 217–237. Attneave, F. (1954). Some informational aspects of visual perception. Psychol. Rev., 61, 183–193. Baddeley, R., Abbott, L. F., Booth M. C. A., Sengpiel, F., Freeman, T., Wakeman, E. A., & Rolls, E. T. (1997). Responses of neurons in primary and inferior temporal visual cortices to natural scenes. Proc. R. Soc. Lond. B., 264, 1775– 1783. Barlow, H. B. (1961). Possible principles underlying the transformation of sensory messages. In W. A. Rosenblith (Ed.), Sensory communication. Cambridge, MA: MIT Press. Bear, M. F., Cooper, L. N., & Ebner, F. F. (1987). A physiological basis for a theory of synapse modication. Science, 237, 42–48. Bienenstock, E. L., Cooper, L. N., & Munro, P. W. (1982). Theory for the development of neuron selectivity: Orientation specicity and binocular interaction in visual cortex. J. Neurosci., 2, 32–48. Bonds, A. B. (1991). Temporal dynamics of contrast gain in single cells of the cat striate cortex. Visual Neuroscience, 6, 239–255. Clothiaux, E. E., Bear, M. F., & Cooper, L. N. (1991). Synaptic plasticity in visual cortex: Comparison of theory with experiment. J. Neurophys., 66, 1785– 1804. Das, A., & Gilbert C. D. (1999). Topography of contextual modulations mediated by short-range interactions in primary visual cortex. Nature, 399, 655–661.
Visual Contrast Response Function and Information Rate
541
Dean, A. F. (1981). The variability of discharge of simple cells in the cat striate cortex. Exp. Brain Res., 44, 437–440. DeAngelis, G. C., Anzai, A., Ohzawa, I., & Freeman, R. D. (1995). Receptive eld structure in the visual cortex: Does selective stimulation induce plasticity? Proc. Natl. Acad. Sci. U.S.A., 92, 9682–9686. Fienberg, S. E. (1974). Stochastic models for single neuron ring trains: A survey. Biometrics, 30, 399–427. Gallegher, R. G. (1968). Information theory and reliable communication. New York: Wiley. Geisler, W. S., & Albrecht, D. G. (1992). Cortical neurons: Isolation of contrast gain control. Vision Res, 32, 1409–1410. Gilbert, C. D., Das, A., Ito, M., Kapadia, M., & Westheimer, G. (1996). Spatial integration and cortical dynamics. Proc. Natl. Acad. Sci. U.S.A., 93, 615– 622. Gottschalk, A. (1997). Information based limits on synaptic growth in Hebbian models. In J. M. Bower (Ed.), Computational neuroscience: Trends in research, 1997 (pp. 309–313). New York: Plenum. Heeger, D. J. (1992a). Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9, 181–197. Heeger, D. J. (1992b). Half-squaring in responses of cat striate cells. Visual Neuroscience, 9, 427–443. Heeger, D. J. (1993). Modeling simple-cell direction selectivity with normalized, half-squared, linear operators. J. Neurophysiol., 70, 1885–1898. Koshland, D. E., Goldbeter, A., & Stock, J. B. (1982).Amplication and adaptation in regulatory and sensory systems. Science, 217, 220–225. Laughlin, S. B. (1981). A simple coding procedure enhances a neuron’s information capacity. Z. Naturforsch., 36, 910–912. Laughlin, S. B., Anderson, J. C., O’Carroll, D., & de Ruyter van Steveninck, R. (2000). Coding efciency and the metabolic cost of sensory and neural information. In R. Baddeley, P. Hancock, & P. Foldi ¨ ak (Eds.), Information theory and the brain (pp. 41–61). New York: Cambridge University Press. Levy, W. B., & Baxter, R. A. (1996). Energy efcient neural codes. Neural Computation, 8, 531–543. Luenberger, D. G. (1969). Optimization by vector space methods. New York: Wiley. Montague, P. R., Gancayco, C. D., Winn, M. J., Marchase, R. B., & Friedlander, M. J. (1994). Role of NO production in NMDA receptor-mediated neurotransmitter release in cerebral cortex. Science, 263, 973–977. Naka, I. K., & Rushton, W. A. H. (1966). S-potentials from colour units in the retina of sh (Cyprinidae). J. Physiol. (Lond.), 185, 536–555. Ohzawa, I., Sclar, G., & Freeman, R. D. (1985). Contrast gain control in the cat’s visual system. J. Neurophysiol., 54, 651–667. Palmer, L. A., & Davis, T. L. (1981). Receptive-eld structure in cat striate cortex. J. Neurophysiol., 46, 260–276. Pettet, M. W., & Gilbert, C. W. (1992). Dynamic changes in receptive-eld size in cat primary cortex. Proc. Natl. Acad. Sci. U.S.A., 89, 8366–8370. Plumbley, M. D. (1993). Efcient information transfer and anti-Hebbian neural networks. Neural Networks, 6, 823–833.
542
Allan Gottschalk
Sclar, G., & Freeman, R. D. (1982). Orientation selectivity in the cat’s striate cortex is invariant with stimulus contrast. Exp. Brain Res., 4, 457–461. Sclar, G., Maunsell, J. H. R., & Lennie, P. (1990). Coding of image contrast in central visual pathways of the macaque monkey. Vision Res., 30, 1–10. Tolhurst, D. J., Movshon, J. A., & Dean, A. F. (1983). The statistical reliability of signals in single neurons in cat and monkey visual cortex. Vision Research, 23, 775–785. Toyama, K., Kimura, M., & Tanaka, K. (1981a). Cross-correlation analysis of interneuronal connectivity in cat visual cortex. J. Neurophysiol., 46, 191–201. Toyama, K., Kimura, M., & Tanaka, K. (1981b). Organization of cat visual cortex as investigated by cross-correlation. J. Neurophysiol., 46, 202–214. Ullman, S., & Schechtman, G. (1982). Adaptation and gain normalization. Proc. R. Soc. Lond. B, 216, 299–313. Wilson, H. R., & Humanski, R. (1993). Spatial frequency adaptation and contrast gain control. Vision Res., 33, 1133–1149. Received April 24, 2000; accepted May 25, 2001.
LETTER
Communicated by Christian Wehrhahn
A Bayesian Framewor k for Sensory Adaptation Norberto M. Grzywacz
[email protected] Department of Biomedical Engineering, Universityof Southern California,Los Angeles, CA, 90089-1451, U.S.A. Rosario M. Balboa
[email protected] Departamento de BiotecnologÂõ a, Universidad de Alicante, Apartado de Correos 99, 03080 Alicante, Spain Adaptation allows biological sensory systems to adjust to variations in the environment and thus to deal better with them. In this article, we propose a general framework of sensory adaptation. The underlying principle of this framework is the setting of internal parameters of the system such that certain prespecied tasks can be performed optimally. Because sensorial inputs vary probabilistically with time and biological mechanisms have noise, the tasks could be performed incorrectly. We postulate that the goal of adaptation is to minimize the number of task errors. This minimization requires prior knowledge of the environment and of the limitations of the mechanisms processing the information. Because these processes are probabilistic, we formulate the minimization with a Bayesian approach. Application of this Bayesian framework to the retina is successful in accounting for a host of experimental ndings. 1 Introduction
One of the most important properties of biological sensory systems is adaptation. This property allows these systems to adjust to variations in the environment and thus to deal better with it. Adaptation has been studied in many biological sensory systems (Thorson & Biederman-Thorson, 1974; Laughlin, 1989). Are there general principles that guide biological adaptation? A host of theoretical principles has been proposed in the literature, and many of them have been applied to the visual system (Srinivasan, Laughlin, & Dubs, 1982; Atick & Redlich, 1992; Field, 1994). Recently, we showed that many of these principles cannot account for spatial adaptation in the retina (Balboa & Grzywacz, 2000a). We then proposed that the principle underlying this type of adaptation is the setting of internal parameters of the retina such that it can perform optimally certain prespecied visual tasks (Balboa & Grzywacz, 2000b). Key to this proposal was that the visual inforNeural Computation 14, 543–559 (2002)
° c 2002 Massachusetts Institute of Technology
544
Norberto M. Grzywacz and Rosario M. Balboa
mation is probabilistic (Yuille & Bulthoff, ¨ 1996; Knill, Kersten, Mamassian, 1996; Knill, Kersten, & Yuille, 1996) and retinal mechanisms noisy (Rushton, 1961; Fuortes & Yeandle, 1964; Baylor, Lamb, & Yau, 1979). Hence, any optimization would have to follow Bayesian probability theory (Berger, 1985). For this purpose, the visual system would have to store knowledge about the statistics of the visual environment (Carlson, 1978; van Hateren, 1992; Field, 1994; Zhu & Mumford, 1997; Balboa & Grzywacz, 2000b, 2000c; Balboa, Tyler, & Grzywacz, 2001; Ruderman & Bialek, 1994), and about its own mechanisms and neural limitations. In this article, we extend these principles to formalize a general framework of sensory adaptation. This framework begins with the Bayesian formulas but encodes tasks and errors through terms akin to the loss function (Berger, 1985), which quanties the cost of making certain sensory decisions. As an example, we will show how to apply the framework to the spatial adaptation of retinal horizontal cells. The framework and some of the results appeared previously in abstract form (Balboa & Grzywacz, 1999). 2 A Bayesian Framework of Adaptation 2.1 Rationale. The basic architecture of the new framework of adaptation comprises three stages (see Figure 1). The rst, which we call the stage of preprocessing, transforms the sensory input into the internal language of the system. The second stage, task coding, transforms the output of the stage of preprocessing to a code that the system can use directly to perform its tasks. 1 We propose that the tasks are the estimation of desired attributes from the world, which we call the task-input attributes. This estimation would be performed by specially implemented recovery functions (task-coding functions). Finally, the third stage estimates the typical error in performing the tasks, that is, the discrepancy between the task-input attributes and their estimates from the task-coding functions. The key phrase here is typical error. Because the system does not know what the actual environmental attributes are and can only estimate them, it cannot know what the actual error is at every instant. Hence, the best the system can aim at is a statistically typical estimation of error for the possible ensemble of inputs. Such an estimate is possible only if one assumes that the environment is not random, which means that the statistics of the input to the system at a given time has some consistency with the statistics of the input at subsequent times. For instance, a forest at night produces mostly dark images and thus denes a temporally consistent environment for the visual system. In turn, the same forest during the day produces bright images and represents a different, consistent environment from the one dened at night. 1 We postulate a stage of preprocessing separate from a stage of task coding, because the output of the stage of preprocessing can be used for multiple tasks. For example, the output of retinal bipolar cells can carry contrast information, which could be useful for motion-, shape-, and color-based tasks.
A Bayesian Framework for Sensory Adaptation
545
Figure 1: Schematic of the new framework of adaptation. A multivariate input impinges on the system. The input is rst processed by the stage of preprocessing, whose output is then fed to a task-coding apparatus. The goal of this apparatus is prepare the code for the system to achieve the desired task. The apparatus has two outputs: one to fulll the task and another to an error estimation center. This center integrates the incoming information to evaluate the environment and then selects the appropriate prior knowledge from the world to use. In addition, the error estimation center uses knowledge about the stage of preprocessing and the tasks to be performed. With this knowledge, the system estimates the mean task-performance error, not just for the current input but for all possible incoming inputs. The system then sends an adaptation signal to the stage of processing so that this stage modies its internal parameters to minimize the error.
The principle underlying our framework for adaptation is the setting of internal parameters to adjust the system to the environment such that the system can perform optimally certain prespecied tasks (Balboa & Grzywacz, 2000a, 2000b). This optimization means that the goal of the system is to minimize the expected number of errors in the task-coding process, adjusting the free parameters of adaptation at the stage of preprocessing such that the estimated error is as low as possible. Therefore, optimality implies that the system has knowledge about the environment and about the mechanisms and limitations of the system itself (for instance, noisiness, sluggishness, and small number of information channels) (see Figure 1). These kinds of knowledge could come from evolution, development, or learning.2 2 Although people often refer to these three processes as forms of adaptation, we are not addressing them in this article. We mention them here only as a means through which biological systems can attain knowledge useful for sensory adaptation.
546
Norberto M. Grzywacz and Rosario M. Balboa
2.2 Framework. This section expresses mathematically each element of the framework illustrated in Figure 1 and emphasized in section 2.1. Let IE be the input in Figure 1 (process labeled 1 in the gure) and H1 ( IE) , H2 ( IE) , . . . , HN E For instance, in sec(IE) the relevant task-input attributes to extract from I. tion 3, these attributes will be things like contrasts and positions of occluding borders in a visual image. The output (the process labeled 3) of the stage of E is transformed by the task-coding stage processing (process labeled 2), O, E E ). These functions are ( (the process labeled 4) to RH1 O ) , RH2 (OE ) , . . . , RHN ( O estimates of the values of the task-input attributes. To evaluate the error (the process labeled 5), Ei , in each of these estimates, one must measure the disE ) . The system cannot know exactly what crepancy between Hi ( IE) and RHi ( O this discrepancy is, since it does not have access to the input, only to its estimates. However, the system can estimate the expected amount of error. To do so, the system must have previous knowledge (the process labeled 6) E about the task, and about the stage of processing that about the probable I, E Adaptation (the process labeled 7) would be the setting of the produces O. stage-of-preprocessing parameters (A) such that the error is minimal over the ensemble of inputs.3 The error to be minimized is
E ( A) D
N X
Ei ( A ) .
(2.1)
iD 1
The error, Ei ( A) , can be dened using statistical decision theory (Berger, E RE H ) (where RE H D RH ( O E ) ), 1985). We begin by dening a loss function L ( I, i i i which measures the cost of deciding that the ith attribute is RE Hi given that E Using the loss function, statistical decision theory denes the the input is I. Bayesian expected loss as ± ² Z ± ² ± ² ER EH L E RE H , lA RE Hi D P A I| I, i i IE
(2.2)
E RE H ) is the conditional probability, with the parameters of adapwhere P A ( I| i tation set to A, that the correct interpretation of the input is IE given that the response of the system is RE Hi . The Bayesian expected loss is the mean loss 3
The parameters of adaptation can be thought of as a representation of the environment. Here, we choose to dene the framework through the parameters of adaptation instead of those of the environment (for instance, time of the day) for two reasons. First, our framework is a mathematical formulation for the role of adaptation. Second, there may not be a one-to-one correspondence between the parameters of the environment and those of adaptation. If we had chosen to parameterize our framework through the environment, then the framework would be similar to the Bayesian process of prior selection (Berger, 1985). In prior selection, one chooses the correct prior distribution to use before performing a task. One denes the various possible priors by special hyperparameters.
A Bayesian Framework for Sensory Adaptation
547
of the system given that response. We dene the error as the mean Bayesian expected loss over all possible responses, that is, Z ± ² ± ² ( ) (2.3) Ei A D P A RE Hi lA RE Hi . RE Hi
That we minimize the sum of these errors (see equation 2.1) is the same as using conditional Bayes’ principle of statistical decision theory (Berger, 1985). In simple words, we choose the “action” that minimizes the loss. From standard Bayesian analysis, one can obtain a particularly useful form of the error in equation 2.3 by using Bayes’ theorem,4 that is, ± ² ±² ± ² P A RE Hi | IE P IE ± ² . (2.4) PA IE | RE Hi D P A RE Hi ER E H ) in equation 2.2 and then plugging By substituting equation 2.4 for P A (I| i the result in equation 2.3, one gets Z Z ± ² ±² ± ² E RE H . (2.5) Ei ( A ) D P A RE Hi | IE P IE L I, i IE RE Hi
One can give intuitive and practical interpretations of the probability terms in this equation. The rst term, the likelihood function P A (RE Hi | IE) , embodies the knowledge about the sensory mechanisms; the second, P ( IE), is the prior knowledge about the input. In other words, PA ( RE Hi | IE) tells how the system E Because the response is composed of the responds when the stimulus is I. stage of preprocessing and the task-coding functions, it is sometimes useful to unfold equation 2.5 as Z Z Z ± ² ± ² ±² ± ² E PA O| E IE P IE L E RE H . Ei ( A ) D P A RE Hi | O I, i E IE O
RE Hi
In this equation, we divided the knowledge of the sensory process into E ) and P A ( O| E IE) . The former reects the task, while the latter reects P A (RE Hi | O the computations performed in the stage of preprocessing. In this article, we assume for simplicity that the task-coding functions have no noise and thus use the simpler form in equation 2.5. E in this equation as if this probability function does not depend on We write P(I) the parameters of adaptation. However, as discussed in note 3, these parameters can be thought of as a representation of the environment. Hence, it would have been more correct E in equation 2.4. The reason that we avoid doing that is that to insert the subindex A in P (I) for some readers, it might be confusing having a notation suggesting that the distribution of images depends on adaptation. We preferred not to use the subindex A and to think of E as a prior distribution of the current environment (see section 2.1). P(I) 4
548
Norberto M. Grzywacz and Rosario M. Balboa
3 An Application of the Framework: Spatial Retinal Adaptation 3.1 Some Task-Input Attributes Proposed for the Retina. We previously proposed that retinal horizontal cells are part of a system to extract from images the position of occluding borders, contrast at occluding borders, and intensity away from them (Balboa & Grzywacz, 2000b). To quantify these image attributes (the functions Hi of equation 2.5) we begin by dening contrast as | r I (Er) / I (Er ) | (see Balboa & Grzywacz, 2000b, 2000c, for a justication).5 From this denition, the functions Hi may be
³ ´ ¡ ¢ ¡ ¡ ¢¢ r I Er ¡ ¢ H1 I Er D PO I Er ¡ ¢ ¡ ¡ ¢¢ r I Er H2 I Er D ¡ ¢ I Er ¡ ¡ ¢¢ ¡ ¡ ¢¢ ¡ ¢ ¡¢ H3n C 1 I Er C H3 I Er D I Er G IN ,
(3.1)
where 0 · PO (| r I ( Er ) / I (Er) |) · 1 is the probability that a point with the given contrast has an occluding border and 0 < G ( IN) · 1 is a prespecied, positive, decreasing function of the mean intensity, I.N Consequently, H1 , H2 , and H3 quantify the likelihood of a border at position Er, contrast, and intensity, respectively. In this application, H1 is made to depend only on rI I(Er(Er) ) for simplicity; one may introduce other variables in more complex models. The quantication of intensity by H3 is indirect, but we can justify this quantication from independent results. The function G transforms the right-hand side of the denition of H3 (see equation 3.1) to a compressed function of I. The polynomial form of H3 causes it to be a fractional power law of the right-hand side. Hence, H3 is a compressed function of intensity, allowing the retina to deal with the wide range of natural intensities (Shapley, 1997; Smirnakis, Berry, Warland, Bialek, & Meister, 1997; Adorjan, Piepenbrock, & Obermayer, 1999). 3.2 Prior Knowledge of Images. Many literature studies address the prior knowledge on the distribution of natural images (Carlson, 1978; van Hateren, 1992; Field, 1994; Zhu & Mumford, 1997; Ruderman & Bialek, 1994; Balboa & Grzywacz, 2000b, 2000c). These studies focus on certain statistical moments of the images and thus cannot tell what P (IE) is. What these 5
For notational simplicity, we assume that the input is mapped to the output of the outer plexiform layer (OPL) in a one-to-one manner. (In this layer, the photoreceptor synapses serve as input, the horizontal cells serve as interneurons, and the bipolar cells serve as output.) Thus, we will use Er to indicate position in both the input and the output. Alternatively, one could use subscripts to indicate the discrete positions of receptive-eld centers (Srinivasan et al., 1982; Balboa & Grzywacz, 2000b).
A Bayesian Framework for Sensory Adaptation
549
studies reveal is that the moments obey certain regularities, and thus not all possible images occur naturally. The moments of interest here are those specied by equation 3.1, that is, the task-input attributes. For simplicity, the approach of this article is to assume rst that P (IE) is constant over the range of naturally occurring images and zero outside it. Then we use a sample of natural images to estimate the occurrence of the attributes in images. (By the last assumption, the images in the sample have equal probability among themselves and the same probability as any image outside the sample.) 3.3 Knowledge of Neural Processes. To specify the knowledge of the neural processing, one has to describe what is known about the OPL (see note 5), dene the task-coding functions (see Figure 1), and then calculate PA ( RE Hi | IE) . Unfortunately, the outputs of the task-coding functions have E Therefore, it is hard to provide comcomplex, nonlinear dependencies on I. plete analytical formulas for PA (RE Hi | IE) . In Balboa and Grzywacz (2000b), we diminish the need for such formulas through some reasonable approximations. A model for the OPL is what we use for the stage-of-preprocessing box of Figure 1. This model is based on a wealth of literature (for a review, see Dowling, 1987) and was presented elsewhere (Balboa & Grzywacz, 2000b). The basic equation of the model is
¡ ¢ B Er D
¡¢ T Er ¡ ¡¢ ¡ ¢¢n , 1 C hA Er ¤ B Er
(3.2)
where T ( Er) and B (Er ) are the intensity-dependent input and output of the synapse of the photoreceptor, ¤ stands for convolution, hA (Er ) is the lateralinhibition lter (which can change depending on the state A of adaptation), and n ¸ 1 is a constant. The variable B (Er ) can be pre- or postsynaptic, since we assume here for simplicity that the photoreceptor-bipolar synapse is linear. It is also postulated that hA (Er) is positive (h A (Er ) > 0), isotropic (if | Er| D R | rE0 |, then h A (Er ) D hA (rE0 ) ), and normalized ( hA (Er) D 1). The normalization assumption is made for simplicity, because it was shown that the results are invariant with the integral of hA (Balboa & Grzywacz, 2000b). One can think of T (Er) as the current generated by phototransduction. Hence, a simple way to introduce the phototransduction adaptation is to dene a gain function G ( IN) such that T ( Er) D I (Er) G ( IN) (see equation 3.1). In this case, equation 3.2 becomes ¡ ¢ B Er D
¡¢ ¡¢ I Er G IN ¡ ¡¢ ¡ ¢¢n . 1 C hA Er ¤ B Er
(3.3)
Dened as in this equation, B ( Er) provides a straightforward signal for the desired attributes specied in equation 3.1. If one stimulates the retina
550
Norberto M. Grzywacz and Rosario M. Balboa
with a full-eld illumination and disregards noise, then in steady state, equation 3.3 becomes BN D
¡¢ IN G IN , 1 C BN n
(3.4)
which resembles the H3 term in equation 3.1. Moreover, mathematical analysis (derivations not shown) shows that if one stimulates the retina with an edge and disregards noise, then in steady state, one gets to a good approximation at the edge: ¡ ¢ r B Er D ¡ ¢ B Er
¡ ¢ r I Er . ¡ ¢ I Er
(3.5)
In other words, luminance contrast is proportional to the contrast in the bipolar cell responses. 3.4 Task-Coding Functions. The role of the task-coding functions is to estimate the task attributes (Hi ) from the output of the OPL. By comparing equation 3.4 to the H3 term of equations 3.1, one sees that B and H3 have the same dependence on light intensity. Therefore, to recover a compressed version of light intensity from the bipolar signals, all that the task-coding stage has to do is to read them directly. Comparison of equation 3.5 to the H2 term of equation 3.1 yields a similar conclusion. The task-coding stage can compute the illumination contrast directly from the bipolar signals. To do so, this mechanism only has to compute the gradient of the bipolar cell responses and divide it by the responses themselves. Furthermore, assume that the function P O from the rst term of equation 3.1 is known (Balboa & Grzywacz, 2000c). In this case, the task-coding stage can extract the borderposition attribute from the contrast signal in the bipolar cells. For given adaptation settings (A), we can express these conclusions in the following equation:
RH1
RH2 RH3
³ ´ ¡ ¢ ¡¢ r B Er Er D PO ¡ ¢ B Er ¡ ¢ ¡¢ r B Er Er D ¡ ¢ B Er ¡¢ ¡¢ Er D B Er .
(3.6)
3.5 Retinal Loss Functions. We now must specify the cost of the system extracting these image attributes incorrectly. The rst attribute (H1 ) quanties the likelihood that there is a border in a given position. A good loss
A Bayesian Framework for Sensory Adaptation
551
function for this attribute should penalize missing true occluding borders. (We worry less about false borders being detected, since the central visual system has heuristics to eliminate false borders, such as false borders not giving rise to long, continuous contours; Field, Hayes, & Hess, 1993; Kovacs & Julesz, 1993; Pettet, McKee, & Grzywacz, 1997.) For the contrast attribute (H2 ), the loss function should increase as the contrast diverges from truth, especially at points likely to contain a border. In contrast, a good loss function for the third attribute (H3 ) should penalize discrepancies of intensity estimation at points unlikely to be occluding borders. A loss function for the attribute H1 that meets these conditions is ± ² ³Z ¡ ¢¡ ¡ ¢¢ E E L I, RH1 D H1 I (Er) 1 ¡ RH1 Er Er
¡
¡ ¢ ¡ ¢¢2k £ RH1 Er ¡ H1 I (Er )
´ 1/ 2
,
(3.7)
where k is an integer. This equation penalizes missing borders. The term H1 ( I ( Er) ) enforces computations on input borders, while the term (1¡ RH1 (Er ) ) matters only if the system missed a border at Er. Because 0 · H1 ( I ( Er) ) , RH1 ( Er ) · 1 (see equations 3.1 and 3.6), large values of k force the loss function to penalize clear-cut errors. Such errors are RH1 (Er) ¼ 0 and H1 ( I (Er) ) ¼ 1. In contrast, results like RH1 (Er) D H1 (I (Er ) ) D 0.5, which are not in error but would contribute to equation 3.7 if k D 0 would not do so if k À 1. In general, this equation works as a counter of the number of errors instead of as a measure of their magnitudes. This equation is similar to the standard 0 ¡ 1 loss function (Berger, 1985). For the attribute H2 , a loss function meeting the conditions above is ´ ± ² ³Z ¡ ¢¡ ¡¢ ¡ ¢¢2 1 / 2 E RE H D ( ) ( ) E E ¡ E . I, H I r R r H I r 1 2 H 2 2
L
(3.8)
Er
This equation also enforces computations on borders as equation 3.7. The H2 term in equation 3.8 expresses contrast discrepancies in absolute terms, since the visual system is sensitive to contrast. This equation is similar to a standard type of loss function called the squared-error loss (Berger, 1985) and to the L2 metric of functional analysis (Riesz & -Nagy, 1990). Finally, a good loss function for the attribute H3 is 0 ³ ¡¢ ¡ ¡ ¢¢ ´2 11 / 2 Z ± ² ¡ ¡ ¡ ¢¢¢ RH3 Er ¡ H3 I Er E RE H D @ A , ¡ ¡ ¢¢ L I, 1 ¡ H1 I Er 3 H3 I Er Er
(3.9)
which computes errors away from borders (the 1 ¡ H1 ( I ( Er) ) term). The H3 ratio of this equation shows that intensity discrepancies are measured as
552
Norberto M. Grzywacz and Rosario M. Balboa
a noise-to-signal ratio. Such a ratio allows the expression of the noise in a dimensionless form and is the same type of error measurement that successfully explained spatial adaptation in Balboa and Grzywacz (2000b). 4 Results
Elsewhere (Balboa & Grzywacz, 2000b), we showed that our retinal application of the general framework of adaptation can account correctly for the spatial adaptation in horizontal cells. This application can also explain the division-like mechanism of inhibitory action (see equation 3.3 and Balboa & Grzywacz, 2000b) and linearity at low contrasts (Tranchina, Gordon, Shapley, & Toyoda, 1981). Moreover, the application is consistent with the psychophysical Stevens’ power law of intensity perception (Stevens, 1970). This section extends the validity of this application by showing that it can account for three new retinal effects. The rst new retinal effect that this application of the framework can account for is the shrinkage of the extent of lateral inhibition for stimuli that are not spatially homogeneous. Experiments by Reifsnider and Tranchina (1995) show that background contrast reduces the OPL lateral spread of responses to superimposed stimuli (see also Smirnakis et al., 1997). Kamermans, Haak, Habraken, and Spekreijse (1996) have neurally realistic computer simulations of the OPL, demonstrating a similar reduction. Figure 2 shows that our model could account for such reductions. Furthermore, this gure illustrates a qualitative prediction of our model: it can account for this reduction when the retina is stimulated by checkerboards. The smaller the squares in the checkerboard are, the less extent the predicted lateral inhibition has. The reason is that with smaller squares, this extent must decrease to prevent as much as possible the straddling of multiple borders by lateral inhibition. If this straddling occurs, the model makes errors in border localization. Because an important task postulated in the retinal application of the general framework is the measurement of contrast, we argued for a divisionlike model of lateral inhibition (see section 3.3). Figure 3 shows that a consequence of this model is the disappearance of lateral inhibition at lowbackground intensities. The inhibition emerges only at high intensities, with its threshold decreasing as n (the cooperativity of inhibition) increases. The disappearance of lateral inhibition at low background intensities is well known, having been demonstrated both physiologically (Barlow, Fitzhugh, & Kufer, 1957; Bowling, 1980) and psychophysically (Van Ness & Bouman, 1967). Functionally, it is not too hard to understand why horizontal cell inhibition disappears in the model. As intensity falls, horizontal cell responses also do so. This causes the conductance change in the photoreceptor synapse (the Bn term in equation 3.4) to be smaller than the resting conductance (the 1 in that equation). The conductance changes specied by the model also lead to a quantitative prediction about the modulation of the inhibitory strength
A Bayesian Framework for Sensory Adaptation
553
Figure 2: Dependence of lateral inhibition extent on the spatial structure of the image. The stimuli (checkerboard images) are those that appear in the insets. The curves display a bell-shape behavior, with the lateral inhibition extent increasing as the size of the squares in the checkerboard increases.
with intensity. Because the phototransduction adaptation gain (G( IN) ) can be estimated from the photoreceptor literature (Hamer, 2000; Fain, Matthews, Cornwall, & Koutalos, 2001), equation 3.4 can be solved. With BN in hand, one can then predict how the inhibitory strength increases with intensity and test dependence experimentally. The division-like mechanism of lateral inhibition also has implications for the model Mach bands. A property of Mach bands is that there is an asymmetry of positive and negative Mach bands at an intensity border (Fiorentini & Radici, 1957; Ratliff, 1965); the Mach band is larger on the positive side than it is on the negative side. This is a well-known Mach band effect without a good explanation until now. Figure 4 shows that this Mach band asymmetry occurs in our model. This gure also shows that the asymmetry is more prominent as the parameter n increases. From our model, it is not difcult to understand this edge asymmetry. Although the inhibition coming from the positive side of the edge is strong, if the intensity at the negative side is near zero, then the response there will be near zero (see the I (Er) term in equation 3.3). In contrast, the intensity on the positive side of the edge is high, and thus the effect of lateral inhibition can work there, producing a large Mach band.
554
Norberto M. Grzywacz and Rosario M. Balboa
Figure 3: Lateral inhibition strength as a function of intensity. These curves are parametric on n (see equation 3.2) and assume G ( IN ) D 1, and their inhibitory strength is dened as Bn / (1 C Bn ) . As the intensity falls, the inhibition disappears, as indicated by its strength going to zero. The threshold for the emergence of inhibition as a function of intensity falls as n increases.
5 Discussion 5.1 Adaptation as a Bayesian Process. Ours is not the rst Bayesian framework of sensory processing. Other theories have proposed Bayesian mechanisms for perception (for instance, Yuille & Bulthoff, ¨ 1996; Knill, Kersten, & Mamassian, 1996; Knill, Kersten, & Yuille, 1996). Those theories begin with the output of neurons or lters, and then ask how to interpret the environment most correctly. What is new in our framework is not the use of the Bayesian framework but the computation of the best internal state for the system. In other words, the new framework is concerned not with interpreting the current input but rather with setting the parameters of the system such that future interpretations are as correct as possible. Another novel aspect of the new framework is the emphasis on the loss function. Past theories tended to make their perceptual decision based on the maximal probability of interpreting the input correctly given some internal data (Yuille & Bulthoff, ¨ 1996; Knill, Kersten, & Mamassian, 1996; Knill, Kersten, & Yuille, 1996). The use of loss functions means that the most probable interpretation of the input may not be sufcient. Sometimes errors made by not picking less probable interpretations are more costly.
A Bayesian Framework for Sensory Adaptation
555
Figure 4: Mach band asymmetries in the retinal application of the framework. The stimulus was an edge at position 50 (arbitrary units) and of intensities 1.8 and 0.1 at positions lower and higher than 50, respectively. The responses of the bipolar cells were simulated parametric on n (see equation 2.7), with G ( NI ) D 1, and the lter h being at and with a radius of 15. The Mach band was larger at the high-intensity side of the edge than at the low-intensity side. This Mach band asymmetry became more prominent as n increased.
One positive aspect of using a Bayesian framework is that it forces us to state explicitly the assumptions of our models. In particular, one must state the knowledge that the system has about the environment and about its mechanisms, and state the tasks that the system must perform. How does one go about specifying tasks? We believe that tasks are chosen by both what the systems need to do and can do. For instance, although it would be lovely for a barnacle to perform recognition tasks, this animal has only ten photoreceptors. (It has four photoreceptors in its only median eye—Hayashi, Moore, & Stuart, 1985; Oland & Stuart, 1986—and three photoreceptors in each of its two lateral eyes—Krebs & Schaten, 1976; Oland & Stuart, 1986.) Nevertheless, its visual system is sufciently good to allow the animal to hide in its shell when a predator approaches. Therefore, one should not assume automatically that the goal of the early sensory system is to maximize transmission of information (Haft & van Hemmen, 1998; Wainwright, 1999). Furthermore, the tasks that a system performs might be intimately coupled to its hardware, against what was argued by Marr (1982). The system must perform computation, and its hardware makes certain types of computations easier than others. 5.2 Limitations of the Framework. Elsewhere, we discuss the limitations of our retinal application of the framework (Balboa & Grzywacz, 2000b); here, we focus on problems related to the framework itself.
556
Norberto M. Grzywacz and Rosario M. Balboa
Perhaps the most serious problem with the framework is that to apply E R E H ) , that is, the prior equation 2.5, one must have PA ( RE Hi | IE) , P ( IE) , and L ( I, i knowledge and the loss function. We assume that (but do not state how) the system would attain these things through evolution, development, and learning. A complete framework would have to specify the algorithms and mechanisms to attain the prior knowledge and the loss function. An even more serious problem is how to apply the framework when the environment changes (see note 4). 5.3 The Importance of Understanding Tasks in Neurobiology. We have provided one example of how to apply the new framework for sensory adaptation—namely, to spatial adaptation in horizontal cells. Besides specifying the neural processes, the example had to formalize the retinal tasks. As we showed here and elsewhere (Balboa & Grzywacz, 2000a, 2000b), the choice of the tasks can have a large impact on the behavior of the system. For instance, the assumed horizontal cell functions were border localization, contrast estimation, and intensity estimation. One could have argued that all of this may just be a consequence of contrast sensitivity (Shapley, 1997). However, requiring maximization of contrast sensitivity has different consequences from requiring optimization of border location. Responding to any edge is not the same as encoding its position with precision. The requirement of encoding position correctly is essential to obtain the correct behavior for the spatial adaptation of horizontal cells (Balboa & Grzywacz, 2000b). Hence, we argue that to understand the behavior of ensembles of nerve cells, it is not sufcient to gure out their neurobiological processes. One must also carefully study their information processing tasks. Acknowledgments
We thank Alan Yuille and Christopher Tyler for critical comments on the manuscript and JoaquÂõ n de Juan for support and many discussions in the early phases of this project. We also thank the Smith-Kettlewell Institute, where we performed a large portion of the work. This work was supported by National Eye Institute Grants EY08921 and EY11170 to N.M.G. References Adorjan, P., Piepenbrock, C., & Obermayer, K. (1999). Contrast adaptation and infomax in visual cortical neurons. Rev. Neurosci., 10(3–4), 181–200. Atick, J. J., & Redlich, A. N. (1992). What does the retina know about natural scenes? Neural Comp., 4, 196–210. Balboa, R. M., & Grzywacz, N. M. (1999). Biological evidence for an ecologicalbased theory of early retinal lateral inhibition. Invest. Ophthalmol. Vis. Sci., 40, S386. Balboa, R. M., & Grzywacz, N. M. (2000a). The role of early retinal lateral inhibition: More than maximizing luminance information. Visual Neurosci., 17, 77–89.
A Bayesian Framework for Sensory Adaptation
557
Balboa, R. M., & Grzywacz, N. M. (2000b).The minimal-local asperity hypothesis of early retinal lateral inhibition. Neural Comp., 12, 1485–1517. Balboa, R. M., & Grzywacz, N. M. (2000c). Occlusions and their relationship with the distribution of contrasts in natural images. Vision. Res., 40, 2661– 2669. Balboa, R. M., Tyler, C. W., & Grzywacz, N. M. (2001). Occlusions contribute to scaling in natural images. Vision Res., 41, 955–964. Barlow, H. B., Fitzhugh, R., & Kufer, S. W. (1957). Change of organisation in the receptive elds of the cat’s retina during dark adaptation. J. Physiol., 137, 338–354. Baylor, D. A., Lamb, T. D., & Yau, K.-W. (1979). Responses of retinal rods to single photons. J. Physiol., 288, 613–634. Berger, J. O. (1985). Statistical decision theory and Bayesian analysis. New York: Springer-Verlag. Bowling, D. B. (1980). Light responses of ganglion cells in the retina of the turtle. J. Physiol., 209, 173–196. Carlson, C. R. (1978). Thresholds for perceived image sharpness. Phot. Sci. and Eng., 22, 69–71. Dowling, J. E. (1987). The retina: An approachablepart of the brain. Cambridge, MA: Belknap Press, Harvard University Press. Fain, G. L., Matthews, H. R., Cornwall, M. C., & Koutalos, Y. (2001). Adaptation in vertebrate photoreceptors. Physiol. Rev., 81, 117–151. Field, D. J. (1994). What is the goal of sensory coding? Neural Comp., 6, 559– 601. Field, D. J., Hayes, A., & Hess, R. F. (1993). Contour integration by the human visual system: Evidence for a local “association eld.” Vision Res., 33(2), 173– 193. Fiorentini, A., & Radici, T. (1957). Binocular measurements of brightness on a eld presenting a luminance gradient. Atti. Fond. Giorgio Ronchi, 12, 453–461. Fuortes, M. G. F., & Yeandle, S. (1964). Probability of occurrence of discrete potential waves in the eye of the Limulus. J. Physiol., 47, 443–463. Haft, M., & van Hemmen, J. L. (1998). Theory and implementation of infomax lters for the retina. Network, 9(1), 39–71. Hamer, R. D. (2000). Analysis of CAC C -dependent gain changes in PDE activation in vertebrate rod phototransduction. Mol. Vis., 6, 265–286. Hateren, J. H. van (1992). Theoretical predictions of spatiotemporal receptive elds of y LMC’s, and experimental validation. J. Comp. Physiol. A, 171, 157–170. Hayashi, J. H., Moore, J. W., & Stuart, A. E. (1985).Adaptation in the input-output relation of the synapse made by the barnacle’s photoreceptor. J. Physiol., 368, 179–195. Kamermans, M., Haak, J., Habraken, J. B., & Spekreijse, H. (1996). The size of the horizontal cell receptive elds adapts to the stimulus in the light adapted goldsh retina. Vision Res., 36, 4105–4119. Knill, D. C., Kersten, D., & Mamassian, P. (1996). Implications of a Bayesian formulation of visual information for processing for psychophysics. In D. C. Knill & R. Whitman (Eds.), Perception as Bayesian inference (pp. 239–286). Cambridge: Cambridge University Press.
558
Norberto M. Grzywacz and Rosario M. Balboa
Knill, D. C., Kersten, D., & Yuille, A. (1996).Introduction: A Bayesian formulation of visual perception. In D. C. Knill & R. Whitman (Eds.), Perception as Bayesian inference (pp. 1–21). Cambridge: Cambridge University Press. Kovacs, I., & Julesz, B. (1993). A closed curve is much more than an incomplete one: Effect of closure in gure-ground segmentation. Proc. Natl. Acad. Sci. USA, 90, 7495–7497. Krebs, W., & Schaten, B. (1976).The lateral photoreceptor of the barnacle, Balanus eburneus: quantitative morphology and ne structure. Cell Tissue Res., 168, 193–207. Laughlin, S. B. (1989). The role of sensory adaptation in the retina. J. Exp. Biol., 146, 39–62. Marr, D. (1982). Vision. San Francisco: Freeman. Oland, L. A., & Stuart, A. E. (1986). Pattern of convergence of the photoreceptors of the barnacle’s three ocelli onto second-order cells. J. Neurophysiol., 55, 882– 895. Pettet, M. W., McKee, S. P., & Grzywacz, N. M. (1997). Constraints of long range interactions mediating contour detection. Vision Res., 38, 865– 879. Ratliff, F. (1965). Mach bands: Quantitative studies on neural networks in the retina. San Francisco: Holden-Day. Reifsnider, E. S., & Tranchina, D. (1995).Background contrast modulates kinetics and lateral spread of responses to superimposed stimuli in outer retina. Vis. Neurosci., 12, 1105–1126. Riesz, F., & -Nagy, B. S. Z. (1990). Functional analysis. New York: Dover. Ruderman, D. L., & Bialek, W. (1994). Statistics of natural images: Scaling in the woods. Phys. Rev. Letter, 73, 814–817. Rushton, W. A. H. (1961). The intensity factor in vision. In W. D. McElroy & H. B. Glass (Eds.), Light and life (pp. 706–722). Baltimore, MD: Johns Hopkins University Press. Shapley, R. (1997). Retinal physiology: Adapting to the changing scene. Curr. Biol., 7(7), R421–R423. Smirnakis, S. M., Berry, M. J., Warland, D. K., Bialek, W., & Meister, M. (1997). Adaptation of retinal processing to image contrast and spatial scale. Nature, 386(6620), 69–73. Srinivasan, M. V., Laughlin, S. B., & Dubs, A. (1982). Predictive coding: A fresh view of inhibition in the retina. Proc. R. Soc. Lond. B, 216, 427–459. Stevens, S. S. (1970). Neural events and the psychophysical law. Science, 170, 1043–1050. Thorson, J., & Biederman-Thorson, M. (1974). Distributed relaxation processes in sensory adaptation. Science, 183, 161–172. Tranchina, D., Gordon, J., Shapley, R., & Toyoda, J. (1981). Linear information processing in the retina: A study of horizontal cell responses. Proc. Natl. Acad. Sci. USA, 78, 6540–6542. Van Ness, F. L., & Bouman M. A. (1967). Spatial modulation transfer in the human eye. J. Opt. Soc. Am., 57, 401–406. Wainwright, M. J. (1999).Visual adaptation as optimal information transmission. Vision Res., 39, 3960–3974.
A Bayesian Framework for Sensory Adaptation
559
Yuille, A., & Bulthoff, ¨ H. H. (1996). Bayesian decision theory and psychophysics. In D. C. Knill & R. Whitman (Eds.), Perception as Bayesian inference (pp. 123– 161). Cambridge: Cambridge University Press. Zhu, S. C., & Mumford, D. (1997). Prior learning and Gibbs reaction-diffusion. IEEE Trans. on Pattern Analysis and Machine Intelligence, 19, 1236–1250. Received November 13, 2000; accepted May 22, 2001.
LETTER
Communicated by Bard Ermentrout
Analysis of Oscillations in a Reciprocally Inhibitory Network with Synaptic Depression Adam L. Taylor
[email protected] Garrison W. Cottrell
[email protected] William B. Kristan, Jr.
[email protected] University of California, San Diego, La Jolla CA 92093-0357, U.S.A.
We present and analyze a model of a two-cell reciprocally inhibitory network that oscillates. The principal mechanism of oscillation is short-term synaptic depression. Using a simple model of depression and analyzing the system in certain limits, we can derive analytical expressions for various features of the oscillation, including the parameter regime in which stable oscillations occur, as well as the period and amplitude of these oscillations. These expressions are functions of three parameters: the time constant of depression, the synaptic strengths, and the amount of tonic excitation the cells receive. We compare our analytical results with the output of numerical simulations and obtain good agreement between the two. Based on our analysis, we conclude that the oscillations in our network are qualitatively different from those in networks that oscillate due to postinhibitory rebound, spike-frequency adaptation, or other intrinsic (rather than synaptic) adaptational mechanisms. In particular, our network can oscillate only via the synaptic escape mode of Skinner, Kopell, and Marder (1994).
1 Introduction
A reciprocally inhibitory (RI) oscillator, also known as a half-center oscillator, is an oscillatory neuronal circuit that consists of two neurons, each of which exerts an inhibitory effect on the other. (More generally, a circuit could be constructed of two populations of neurons, each population inhibiting the other.) An example of such a circuit is shown in Figure 1A. When the circuit oscillates, one cell res for a time while the other is inhibited below threshold; then the other cell res for a time while the rst is inhibited, and this cycle repeats itself indenitely. Such oscillators have been found to be important components in a number of central pattern Neural Computation 14, 561–581 (2002)
° c 2002 Massachusetts Institute of Technology
562
Adam L. Taylor, Garrison W. Cottrell, and William B. Kristan, Jr.
Figure 1: (A) Circuit schematic. Excitatory synapses are represented by bars and inhibitory synapses by circles. (B) Example oscillation. The values of the four state variables of the system, u1 , u2 , d1 , and d2 , are shown versus time, t. All state variables are unitless, owing to the fact that they are nondimensionalized versions of physical variables. ui values are membrane potentials measured relative to the potential at which the synapses are half-activated, this quantity being given in units of the width of the linear regime of the synaptic activation curve. Time is in units of the membrane time constant of the cells. The parameters for the simulation are W D 16, b D 9, t D 16, with s ( x) D logistic (4x) D 1 / (1 C e¡4x ) . See the text for an explanation of the intervals labeled a, b, c, and d.
generators (CPGs), including the leech heartbeat CPG (Calabrese, Nadim, & Olsen, 1995), the Clione swim CPG (Arshavsky et al., 1998), the lobster gastric and pyloric CPGs (Selverston & Moulins, 1987), the leech swim CPG (Brodfuehrer, Debski, O’Gara, & Friesen, 1995; Friesen, 1989), the Xenopus tadpole swim CPG (Roberts, Soffe, & Perrins, 1997), and the lamprey swim CPG (Grillner et al., 1995). A number of mechanisms have been shown to be sufcient to cause an RI circuit to oscillate. Among these are postinhibitory rebound, spikefrequency adaptation, and short-term synaptic depression (Reiss, 1962;
Oscillations with Synaptic Depression
563
Perkel & Mulloney, 1974). The rst two mechanisms are similar, and we will refer to them collectively as intrinsic adaptation. Intrinsic adaptation is a property of the cells themselves, whereas synaptic depression is a property of the connections between them. Without some sort of adaptational mechanism in the circuit, one would not expect it to oscillate. This has in fact been proven in one natural formalization of a nonadaptational RI circuit (Ermentrout, 1995). A number of authors (Wang & Rinzel, 1992; Skinner, Kopell, & Marder, 1994; LoFaro, Kopell, Marder, & Hooper, 1994; Rowat & Selverston, 1997) have analyzed the behavior of intrinsic adaptation RI circuits by using dynamical systems techniques. This work has revealed a number of interesting properties. For instance, Skinner et al. (1994) found that there were at least four functionally distinct modes of oscillation their model could exhibit: intrinsic release, intrinsic escape, synaptic escape, and synaptic release. Each of these modes displayed a different mechanism of burst termination and a different frequency response to parameter variations. Despite the success of applying dynamical systems methods to intrinsic adaptation RI circuits, we are aware of no previous attempt to apply the same techniques to RI circuits in which oscillations arise solely due to synaptic depression. Synaptic depression is known to play a role in the oscillation of at least one well-studied RI circuit, the leech heart CPG, for some modes of oscillation (Calabrese et al., 1995). There is also evidence of synaptic depression in CPGs that contain RI subcircuits, including the leech swim CPG (Mangan, Cometa, & Friesen, 1994) and the lobster pyloric CPG (Manor, Nadim, Abbott, & Marder, 1997). It has been proposed that synaptic depression acts as a switch in the lobster pyloric CPG, although it is not the ultimate source of oscillation in this system (Nadim, Manor, Kopell, & Marder, 1999). In addition, there has recently been a great deal of interest in possible functions of synaptic depression in mammalian neocortex (Abbott, Varela, Sen, & Nelson, 1997; Tsodyks & Markram, 1997; Galarreta & Hestrin, 1998). In this article, we use dynamical systems methods to analyze an RI circuit in which oscillations arise due to synaptic depression. Using these methods and employing certain approximations, we are able to derive a closed-form expression for the parameter regime in which the model will oscillate. We are also able to derive closed-form expressions for the period, amplitude, and DC offset of these oscillations. We nd that these synaptic-depressionmediated oscillations always operate via the synaptic escape mode of Skinner et al. (1994). The work presented here grew out of a larger effort to understand how the leech swim CPG generates oscillations (Taylor, Cottrell, & Kristan, 2000). In our earlier work on this subject, we modeled the leech segmental swim CPG as a set of passive cells with instantaneous synapses between them and connectivity roughly consistent with that determined by experiment. The tentative result of this work was that such a model is able to oscillate
564
Adam L. Taylor, Garrison W. Cottrell, and William B. Kristan, Jr.
but that even small perturbations of the synaptic strengths destroy this ability. We therefore became interested in the possible role of various cellular and synaptic properties in generating more robust oscillations. One such property, for which there is some evidence in the leech swim CPG (Mangan et al., 1994), is synaptic depression. This mechanism was incorporated in a recent VLSI model of the leech swim CPG (Wolpert, Friesen, & Laffely, 2000). While this was our original motivation, we should say at the outset that the purpose of this article is not to model a known neuronal system (such as the leech swim CPG). Our goal is simply to elucidate some of the behaviors one would expect of an RI circuit that oscillates via synaptic depression. It therefore behooves us to study a simple model of such a system, since little data is available to justify various possible embellishments. Our model makes predictions about the possible modes of oscillation of such a system, about what parameter regimes permit oscillations, and about how those oscillations should change as the parameters change. Thus, we provide a set of hypothetical hallmarks of RI oscillation via synaptic depression. An experimenter could then compare these hallmarks with a system under investigation to evaluate whether it is likely to oscillate via synaptic depression in the way described by our model. 2 Description of the Model
A common simplied model of neuronal dynamics evolves according to the equation X (2.1) uPi D ¡ui C Wij s ( uj ) C bi . j
There are several ways to derive this equation (or similar equations) from detailed descriptions of neuronal biophysics, all of which involve some simplifying assumptions (Hopeld, 1984; Wilson & Cowan, 1972). All variables in this equation are unitless, owing to the fact that they are nondimensionalized versions of physical variables. ui corresponds to the membrane potential of cell i, Wij to the maximal postsynaptic current due to the synapse from cell j to cell i (the synaptic strength), and bi to any constant current being injected into cell i. s ( x ) is a sigmoid function as dened by Wilson and Cowan (1972). We add synaptic depression to this basic model by adding a state variable dj associated with the synapses from cell j. This variable ranges from zero to one; at zero, the synapse is not depressed at all, and at one, the synapse is completely depressed and can exert no inuence on the postsynaptic cell. The modied equation for the dynamics of ui is then given by X (1 ¡ dj ) Wij s ( uj ) C bi . (2.2) uP i D ¡ui C j
Oscillations with Synaptic Depression
565
We adopt a form for the dynamics of synaptic depression that is rst order and linear: t dPi D 12 s (ui ) ¡ di .
(2.3)
The parameter t is the time constant of synaptic depression. This form for the dynamics of synaptic depression is adopted because it is simple and captures the basic qualitative features of the phenomenon (Tsodyks & Markram, 1997). The factor of 1 / 2 in equation 2.3 is to ensure that the steady-state synaptic efcacy ((1 ¡ dj ) Wij s ( uj ) , where uj has been held constant long enough for dj to come to equilibrium), does not decline with increasing uj . (In fact, we can generalize by replacing the factor 1 / 2 by a factor r, such that 0 < r · 1 / 2, and the dynamics do not change qualitatively. We assume r D 1 / 2 here because this makes the effects of depression large and for simplicity.) We remain agnostic about the specic biophysics of depression—whether it arises from presynaptic vesicle depletion, presynaptic Ca2 C channel inactivation, postsynaptic receptor desensitization, or some other mechanism. By using a single time constant, we are implicitly assuming that the time constants associated with onset and offset of depression are identical, another simplifying assumption. Since we are concerned here only with two-cell networks with symmetric reciprocal inhibition and no self-connections (see Figure 1A), we can rewrite the system equations: uP 1 D ¡u1 ¡ (1 ¡ d2 ) Ws (u2 ) C b uP 2 D ¡u2 ¡ (1 ¡ d1 ) Ws (u1 ) C b
t dP1 D t dP2 D
(2.4) (2.5)
1 ( ) 2 s u1
¡ d1
(2.6)
1 ( ) 2 s u2
¡ d2 .
(2.7)
Here W is taken to be nonnegative, since we are interested only in inhibitory networks. Thus, the model has just three parameters: W, b, and t . For a range of parameter settings, the two-cell system with synaptic depression is capable of stable oscillation, as shown in Figure 1B. The sigmoid function used here and in later examples is s ( x ) D logistic(4x) D 1 / (1 C e¡4x ). The factor of four is to make the slope at x D 0 equal to one, and thus make the linear regime approximately one unit wide. The oscillation proceeds in the following manner. Beginning with cell 1 above threshold and cell 2 below (marked as interval a in the gure), cell 1 inhibits cell 2, and the membrane potential of cell 1 remains relatively constant. This inhibition wanes, however, due to a buildup of synaptic depression (d1 ), so the membrane potential of cell 2 creeps slowly upward. At the same time, the synapse from cell 2 onto cell 1 is recovering (reected by the decrease in d2 ) while cell 2 is below threshold. Eventually, cell 2’s membrane potential is high enough
566
Adam L. Taylor, Garrison W. Cottrell, and William B. Kristan, Jr.
to cross threshold and inhibit cell 1 (interval b). Following this quick transition, we arrive at a state in which cell 2 is above threshold and cell 1 is below (interval c). This situation is the mirror image of the original state, with the roles of the cells reversed. Thus, the membrane potential of cell 1 creeps slowly upward as the incoming inhibition wanes, until cell 1 crosses threshold and begins to inhibit cell 2. Again, a switch quickly occurs (interval d), bringing cell 1 above threshold and pushing cell 2 below. This is the original state, and so the process repeats itself, and an oscillation is produced. 3 Analysis of Model Behavior
In this section we analyze the system to understand in detail how the oscillation works. This enables us to derive the range of parameters for which the system oscillates and how the features of that oscillation (e.g., period, amplitude) vary as the parameters are varied. Our analysis relies on two simplications: Synaptic depression changes slowly relative to changes in membrane potential (t À 1). The sigmoid function can be approximated as a unit step function (s ( x ) ¼ s (x ) , where s ( x ) D 1 if x ¸ 0, otherwise s (x ) D 0). Using these two simplications will allow us to derive approximate closedform expressions for the parameter regime in which oscillations are possible and for the period and amplitude of these oscillations. The rst allows us to apply singular perturbation theory, considering separately the fast u system of equations 2.4 and 2.5 and the slow d system of equations 2.6 and 2.7. Furthermore, synaptic depression seems to be about 5 to 10 times slower than typical membrane time constants in many systems of interest, among them the leech swim CPG (Mangan et al., 1994; Calabrese et al., 1995; Manor et al., 1997). The second simplication allows us to replace a difcult-to-analyze nonlinear function with an easier-toanalyze piecewise constant function. This also seems to accord roughly with the behavior of known RI circuits. Cells oscillate between a hyperpolarized level at which they have little synaptic output and a highly active level at which they have substantial synaptic output (Friesen, 1989; Calabrese et al., 1995; Manor et al., 1997). Making these two assumptions results in a reduced system that is piecewise linear and thus can be solved exactly. 3.1 The Fast System. Assumption 1 implies that the u variables will come to equilibrium much faster than the d variables and will remain at approximate equilibrium as the d variables slowly evolve. If the u variables
Oscillations with Synaptic Depression
567
are approximately at equilibrium, then uP 1 ¼ uP 2 ¼ 0. Thus, equations 2.4 and 2.5 become ¡u1 ¡ (1 ¡ d2 ) Ws ( u2 ) C b ¼ 0
(3.1)
¡u2 ¡ (1 ¡ d1 ) Ws ( u1 ) C b ¼ 0.
(3.2)
We would like to solve these equations for u1 and u2 given d1 and d2 , but the nonlinearity makes it impossible to do this exactly, so we invoke assumption 2, replacing the sigmoid with a unit step function. Thus equations 3.1 and 3.2 become ¡u1 ¡ (1 ¡ d2 ) Ws ( u2 ) C b ¼ 0
(3.3)
¡u2 ¡ (1 ¡ d1 ) Ws ( u1 ) C b ¼ 0.
(3.4)
Since the step function can take on only the values 0 or 1, we can break down equations 3.3 and 3.4 into four separate cases, corresponding to the four possible values of the pair ( s ( u1 ) , s ( u2 ) ) . In each of these cases we can then solve exactly for the u equilibrium. Case 00. s ( u1 ) D 0, s ( u2 ) D 0. In this case, u1 < 0 and u2 < 0, and we have
u1 D b < 0
u2 D b < 0.
(3.5) (3.6)
In this case, the inequalities imply that this equilibrium of the u system exists only if b < 0. This makes intuitive sense: b < 0 corresponds to the situation in which the tonic input is insufcient to elevate the cells’ membrane potential above synaptic threshold. Case 01. s ( u1 ) D 0, s ( u2 ) D 1. In this case, u1 < 0 and u2 > 0, and we have
u1 D b ¡ (1 ¡ d2 ) W < 0 u2 D b > 0.
(3.7) (3.8)
Note that the inequalities imply that this solution exists only if b > 0, and then only if the values of d2 , b, and W satisfy the inequality in equation 3.7. We can reexpress this inequality as d2 < 1 ¡ b / W. Again, this makes intuitive sense. Cell 2 is above threshold and cell 1 below, which can happen only if the tonic input is sufciently excitatory and the inhibition onto cell 1 is not too depressed—that is, only if d2 is below some specied value. Note that the inequality in equation 3.7 implies that b > 0, since d2 must lie in the interval [0, 1 / 2].
568
Adam L. Taylor, Garrison W. Cottrell, and William B. Kristan, Jr.
Case 10. s ( u1 ) D 1, s ( u2 ) D 0. This case is just the mirrorimage of the 01 case.
In this case, u1 > 0 and u2 < 0, and we have u1 D b > 0
u2 D b ¡ (1 ¡ d1 ) W < 0.
(3.9) (3.10)
As in the above case, the inequality in equation 3.10 can be reexpressed as d1 < 1 ¡ b / W. Again, this implies that b > 0.
Case 11. s ( u1 ) D 1, s ( u2 ) D 1. In this case, u1 > 0 and u2 > 0, and we have
u1 D b ¡ (1 ¡ d2 ) W > 0
u2 D b ¡ (1 ¡ d1 ) W > 0.
(3.11) (3.12)
As in cases 01 and 10, the inequalities can be reexpressed as d1 > 1 ¡ b / W
d2 > 1 ¡ b / W.
(3.13) (3.14)
Intuitively, this case corresponds to both synapses being so depressed that neither cell is able to inhibit the other below threshold, so both cells are active. And again, it should be noted that the inequalities in this case imply that b > 0. In addition to deriving solutions for the equilibria of the u system, we have also derived the conditions under which each exists in terms of the W, b, d1 , and d2 . Furthermore, for each case above, there is a unique u equilibrium, so below we will often speak, for instance, of the 01 equilibrium or the 11 equilibrium. The regions in the (d1 , d2 ) plane where each equilibrium is possible are shown in Figure 2. Figure 2A shows the rather dull situation that results when b < 0. In this case, only the 00 equilibrium is possible; both cells are inactive because they receive no tonic excitation. For b > 0, as shown in Figures 2B and 2C, the picture is more interesting. The lines d1 D 1 ¡ b / W and d2 D 1 ¡ b / W divide the plane into four quadrants, each of which allows for a different set of u equilibria, from the case-by-case analysis above. Only one equilibrium is possible in three of these quadrants, but in the fourth (the lower left one), two equilibria are possible: 01 and 10. This is because d1 < 1 ¡ b / W denes the region of possible (d1 , d2 ) values for the 10 equilibrium (we will call points in this region viable and the region as a whole the viable region for that equilibrium), and d2 < 1 ¡ b / W denes the viable region for the 01 equilibrium, and the lower left quadrant in Figure 2B is simply the overlap of these two regions. Figure 2B represents the case where b / W > 1 / 2, and thus 1 ¡ b / W < 1 / 2. In this case, the lines d1 D 1 ¡ b / W and d2 D 1 ¡ b / W fall within the square
Oscillations with Synaptic Depression
569
Figure 2: Regions of the ( d1 , d2 ) plane for which the various u equilibria exist. Each panel shows the ( d1 , d2 ) plane for a different parameter regime. The dashed lines in each panel indicate the square to which ( d1 , d2 ) is constrained. (A) b < 0. Only u equilibrium 00 is possible. (B) b > 0 and b / W > 1 / 2. There are three possible u equilibria. (C) b > 0 and b / W < 1 / 2. Similar to B, except that the lines d1 D 1 ¡ b / W and d2 D 1 ¡ b / W are now outside the square of possible ( d1 , d2 ) values.
570
Adam L. Taylor, Garrison W. Cottrell, and William B. Kristan, Jr.
of possible (d1 , d2 ) values (from equation 2.3, each of the di ’s must be on the interval [0, 1 / 2]). If b / W < 1 / 2 (Figure 2C), these lines fall outside the square, and the system no longer has any regions with only one possible u equilibrium. The system can be in either the 01 or 10 equilibria, regardless of ( d1 , d2 ) , and it must be in one of these two equilibria. This has important consequences for the overall behavior of the system, as will become clear when we consider the behavior of the slow d system. 3.2 The Slow System. Now that we have characterized the equilibria of the fast u system for given d1 and d2 , we would like to describe how the slow d variables evolve, given that the u system is maintained at equilibrium. In equations 2.6 and 2.7, u1 and u2 appear only via s ( u1 ) and s (u2 ) . Utilizing assumption 2, we can write the slow dynamics as
t dP1 D t dP2 D
1 ( ) 2 s u1
¡ d1
(3.15)
1 ( ) 2 s u2
¡ d2 .
(3.16)
Each of the u equilibria discussed above corresponds to one of the four possible values of ( s ( u1 ) , s ( u2 ) ) . Which u equilibrium the system currently occupies will determine the dynamics of the d system. Because of this, we can draw a separate phase portrait in the ( d1 , d2 ) plane for each of these equilibria. This is done in Figure 3. In general, the d dynamics have a single xed point, which is stable, at ( s ( u1 ) / 2, s ( u2 ) / 2) . However, for u equilibria 01 and 10, this point may or may not be viable, depending on whether b / W > 1 / 2 (i.e., depending on whether Figure 2B or 2C applies). We concentrate here on the case depicted in Figure 2B, for which the d xed points for the 01 and 10 equilibria are not in the respective viable regions. Accordingly, Figure 3 represents this case.
Figure 3: Facing page. Viable regions and phase portraits for the three u equilibria possible when b / W > 1 / 2. (A) 10 equilibrium. (B) 01 equilibrium. (C) 11 equilibrium. Points representing viable states of the system are in white; nonviable points are dark gray. Example trajectories for each u equilibrium are shown. The system as shown contains two attractors: a stable xed point at s ( u1 ) D 1, s ( u2 ) D 1, d1 D 1 / 2, d2 D 1 / 2 (C) and a stable limit cycle comprising line segments along the line d1 C d2 D 1 / 2 for s ( u1 ) D 1, s ( u2 ) D 0 and s ( u1 ) D 0, s ( u2 ) D 1 (A and B). The points in the limit cycle where the u equilibrium changes are indicated by the arcs connecting the 10 and 01 planes (A and B, respectively). The various segments of the limit cycle are also labeled a, b, c, and d, to correspond with the labeled parts of the oscillation in Figure 1B. The light gray triangles represent regions for which the system will not oscillate, but rather will eventually transition to the 11 equilibrium.
Oscillations with Synaptic Depression
571
For b > 0 and b / W > 1 / 2, the 01 and 10 equilibria are not in the viable region. In this case, the d system will move toward the xed point until it hits the edge of the viable region, where the u equilibrium the system has been tracking will disappear. At this point, ( u1 , u2 ) will quickly (i.e., on the timescale of the fast system) change to some other equilibrium of the fast u system. Note that at all points where the d system can cross from a viable region to a nonviable region, there is only one u equilibrium to which the fast system can settle after the transition. Thus, the posttransition u equilibrium is uniquely determined in all cases. Graphically, the current value of (d1 , d2 )
572
Adam L. Taylor, Garrison W. Cottrell, and William B. Kristan, Jr.
will move within the white region of one of the panels of Figure 3 until it hits a viable-nonviable (white–dark gray) border, at which point it will jump to the same ( d1 , d2 ) point on another panel, the destination panel being that one for which the current value of ( d1 , d2 ) is in the viable region. Intuitively, the jumps happen when one of the synapses depresses enough (or recovers enough) that the system as a whole can no longer support the u-equilibrium the system was previously in. For instance, if the system is in the 10 equilibrium, but ( d1 , d2 ) crosses the line d1 D 1 ¡ b / W, d1 will now be too large for cell 1 to inhibit cell 2 below threshold, and the system will quickly jump to a state with cell 2 above threshold. 3.3 Attractors. Since the reduced dynamics of ( d1 , d2 ) are piecewise linear, they can be solved exactly. The general solution (including transients) is cumbersome, however, and does not yield additional insight. Instead of presenting the general solution, we describe the various steady-state solutions of the system. It should be noted that these solutions are exact for the reduced system only and are only approximate solutions of the full system. We assess how good these approximations are in section 4.
3.3.1 Limit Cycles. For certain values of the parameters, the system is capable of oscillation, and the mechanism of oscillation can be visualized in Figure 3. For the 10 equilibrium (see Figure 3A), the dynamics of ( d1 , d2 ) has an attracting xed point at (1 / 2, 0). Therefore ( d1 , d2 ) moves toward this point but is unable to reach it because it is in the nonviable region. Upon reaching the line d1 D 1 ¡ b / W, ( s ( u1 ) , s ( u2 ) ) switches to the value (0, 1) . (Actually, if d2 > 1 ¡ b / W at the transition, ( s ( u1 ) , s ( u2 ) ) switches instead to (1, 1), and no oscillation occurs. We discuss this possibility below, but for now assume that d2 < 1 ¡ b / W at the transition.) For ( s ( u1 ) , s ( u2 ) ) D (0, 1) (see Figure 3B), the dynamics of (d1 , d2 ) have an attracting xed point at (0, 1 / 2), and ( d1 , d2 ) begins to move toward this point. Once again the nonviable region intervenes, causing a switch back to ( s ( u1 ) , s ( u2 ) ) D (1, 0) . This cycle then repeats itself. The fact that this cyclic behavior approaches a stable limit cycle can be seen by considering the quantity d1 C d2 . Adding equation 3.15 to equation 3.16, we have that d ( d1 C d2 ) D 12 [s(u1 ) C s (u2 ) ] ¡ ( d1 C d2 ) . dt
(3.17)
Since s (u1 ) C s ( u2 ) D 1 for both ( s ( u1 ) , s ( u2 ) ) D (1, 0) and (s (u1 ) , s (u2 ) ) D (0, 1), the sum d1 C d2 monotonically approaches 1 / 2 as the system switches back and forth between the u equilibria, and thus the system approaches a stable limit cycle along the line d1 C d2 D 1 / 2. This limit cycle is shown in
Oscillations with Synaptic Depression
573
Figures 3A and 3B. The various phases of the limit cycle are labeled a, b, c, and d in the gure to correspond with the same labels in Figure 1. Since the reduced dynamics are linear between the jump points, we can write an exact expression for the steady-state oscillatory solution: ( d1 ( t) D ( d2 ( t) D
dhi exp(¡tQ / t ) 1 / 2 C (dlo ¡ 1 / 2) exp(¡tQ / t )
tQ · T / 2 tQ > T / 2
(3.18)
1 / 2 C (dlo ¡ 1 / 2) exp(¡tQ / t ) dhi exp(¡tQ / t )
tQ · T / 2 tQ > T / 2
(3.19)
where tQ D mod(t, T ), dhi D 1 ¡b / W, dlo D 1 ¡dhi D b / W ¡ 1 / 2, and T is the period of the oscillation, given below. This is the steady-state solution for the initial conditions d1 (0) D dhi and d2 (0) D dlo , but any initial conditions such that d1 C d2 D 1 / 2 will also yield a steady-state oscillation, which will simply be a phase-shifted version of the above. Essentially, the di ’s exponentially decay from dhi to dlo and back again, in antiphase to one another. The period of the oscillations is twice the time it takes ( d1 , d2 ) to go from one jump point to the other along the line d1 C d2 D 1 / 2; thus, we can show that the period is equal to µ T D ¡2t ln
¶ 1 ¡1 . 2(1 ¡ b / W )
(3.20)
From equations 3.18 and 3.19 (or by examining the geometry of the limit cycle, as shown in Figures 3A and 3B), we can derive the amplitude (dened here as the difference between the maximum and minimum values) and time average of the oscillations of the d variables: Ad D
3 2
hdi D
1 4.
¡ 2 Wb
(3.21) (3.22)
Similarly, we can derive these quantities for the u variables: Au D
3 2W
¡b
£ ¡ T ¢¤ hui D b ¡ 14 W C Tt (b ¡ W ) 1 ¡ exp ¡ 2t .
(3.23) (3.24)
3.3.2 Equilibrium Points. In addition to the stable limit cycle, the system also allows for stable xed points. A stable xed point is located at ( d1 , d2 ) D (1 / 2, 1 / 2) for the 11 equilibrium (see Figure 3C). This corresponds to the situation in which both synapses are maximally depressed, and neither synapse is able to inhibit the postsynaptic cell below threshold. In fact, all states for which the u system is at the 11 equilibrium are in the
574
Adam L. Taylor, Garrison W. Cottrell, and William B. Kristan, Jr.
basin of attraction for this xed point, so if both d1 and d2 are greater than 1 ¡ b / W at any point in time, the system settles to this xed point. This reects the fact that if both synapses get so depressed that neither cell is able to inhibit the other below threshold, then both synapses can only get increasingly depressed as time goes on, and hence approach a state of maximal depression, represented by the xed point. Points in the other u equilibria that are also in the basin of attraction for this xed point are shown in light gray in Figures 3A and 3B. All of the above applies to the b / W > 1 / 2 case. On the other hand, if 0 < b / W < 1 / 2, as in Figure 2C, neither the limit cycle nor the described stable xed point exists. In this case, the d equilibria in Figures 3A and 3B are in the viable region and are therefore equilibria of the full system. The system behaves very much like a RI circuit without synaptic depression. It possesses two mirror-symmetricstable states: one with cell 1 above threshold and cell 2 below and one with cell 2 above and cell 1 below. 3.4 Oscillatory Regime. The size of the viable region for the 11 equilibrium changes as a function of the parameters, since its borders are dened by the lines d1 D 1 ¡ W / b and d2 D 1 ¡ W / b. In contrast, the limit cycle (when it exists) is always on the line d1 C d2 D 1 / 2. As a consequence, the limit cycle exists only for parameter values such that the valid region of the 11 equilibrium does not overlap the line d1 C d2 D 1 / 2. This leads to the constraint that 1 ¡ b / W > 1 / 4 (or, equivalently, b / W < 3 / 4) in order for a stable limit cycle to exist, since the corner of the viable region for the 11 equilibrium is at ( d1 , d2 ) D (1 ¡ b / W, 1 ¡ b / W ), and this point just intersects the limit cycle when 1 ¡ b / W D 1 / 4. Combined with the fact that b must also satisfy b / W > 1 / 2 for the limit cycle to exist, we have the following constraint that must be satised for the system to oscillate: 1 2
< b/ W
< 4
w » y± ( 4tsyn ) D
c 4 | |e¡ a t syn
> ¡ :¡ c | 4 |e b t syn
(4tsyn ) 2 2a2
,
4
tsyn < 0
,
4
tsyn ¸ 0,
(4tsyn ) 2 2b 2
(2.2)
with appropriate constants a, b, c > 0 (see Figure 2a). The form of y± is such that the weight is maximally upregulated if the presynaptic spike precedes the postsynaptic spike by a ms, and it is maximally downregulated if the presynaptic spike follows the postsynaptic spike by b ms. The normalization is such that the absolute values of the peaks of the LTP and LTD branches are independent of their widths a and b, respectively. We assume that the proportionality factor in equation 2.2 is the actual weight itself, 4w D wy± ( 4t) . If w expresses the number of quantal release sites, for
Activity-Dependent Development of Axonal and Dendritic Delays
591
instance, the factor w states that each site separately undergoes long-term modication in that it vanishes or forms another nearby release site. The same factor emerges if one deduces equation 2.2 from kinetic schemes governing the involved neurotransmitter receptors and secondary messengers (Senn, Tsodyks, & Markram, 2001). To avoid a blowing up of the synaptic strengths, our analysis will use the intrinsic normalization property of the learning rule, which is obtained when the downregulation dominates the upregulation (b > a; see equation 3.5). Next, we consider slow stochastic uctuations in the axonal delays and dendritic latencies. We observe that according to equation 2.1, it is reasonable to group those delay lines together with the same relative delay d. This is possible since only the time difference 4 tsyn D 4t C d of the signals measured at the subsynaptic site enters in the synaptic dynamics of equation 2.2. We may therefore parameterize the synaptic weights with the relative delay, w (t) D w (d , t) . Due to the linear scaling factor occurring in the rule, these lumped synaptic weights are adapted according to 4 w (d , t) D w (d , t) y± (4 t). Similarly, stochastic changes in the axonal and dendritic delays are relevant only for the synaptic dynamics as far as they change the relative delay d. For an individual pathway, this relative delay may be shortened or extended by changing the axonal delay, Dax , the backward dendritic delay, Dbden , or both. To take account of these uctuations, we assume that for each delay line, the relative delay d may change during a time interval T by 4d with a small probability 2 ¸ 0. With probability 1 ¡ 22 , the delay of the synaptic pathway d does not change. On the level of the lumped weights w (d , t) , this dynamics can be expressed by w (d , t) D 2 w (d ¡ 4d, t ¡ T ) C (1 ¡ 22 ) w (d , t ¡ T ) C 2 w (d C 4d, t ¡ T ) .
(2.3)
If the timescale T of these uctuations is long compared to the timescale of the synaptic adaptation, the two processes do not interfere, and the dynamics for the lumped weights is the same as that for the individual connections. On a timescale longer than T, we are dealing with derivatives, and taking the limit 4d, T ! 0 we obtain after simple algebraic manipulations (4 2 @w (d ,t) @2 w (d,t ) , provided that lim4d,T!0 Td) D 1. Thus, the slow, unbiD 2 @t @d 2 ased stochastic uctuation introduces a diffusion of the synaptic weights along the relative delays. The general scenario we consider is that neuron B res, with some jitter, | 4 t| after neuron A (note that in this case 4t < 0). The ring time tB depends on additional input from a third population onto B (as, e.g., forward input from the LGN; see Figure 1b) but may also be inuenced by the ring of neuron A itself. The basic mechanism is that the rule 2.2 selects delay lines for which the pre- and postsynaptic signal time difference at the subsynaptic site equals the maximum of the learning function, 4tsyn D 4t C d D ¡a. The
592
Walter Senn, Martin Schneider, and Berthold Ruf
stochastic process, equation 2.3, of building new delay lines (i.e., of assigning positive weights to delay lines with previously vanishing weights) ensures that after a while, there is always a delay line with an appropriate relative delay d and nonvanishing w. If the LTD branch of the learning function dominates the LTP branch, b > a in equation 2.2, delay lines that do not meet the above condition will be suppressed. If, in addition, we introduce synaptic transmission failures, those delay lines that do satisfy 4t C d D ¡a, but do not support the postsynaptic spike by virtue of their total forward f
f
6 4t, will eventually delay, that is, for which Dtot D Dax C Dsyn C Dden C trise D be suppressed as well.
3 Weight Modication Induced by Correlated Activity
We assume that the pre- and postsynaptic neurons are ring with some instantaneous Poisson rates fpre and fpost , respectively, which are themselves correlated to some degree. Let us consider a delay line with relative delay d as dened in equation 2.1, and let us assume that the synaptic strength is modied according to equation 2.2. The change in w accumulated in the time interval T (À a, b) preceding the current time t can then be approximated by Z
4
w ( t) ¼ w ( tN)
t t¡T
Z
dt0
t¡t0
t¡t0 ¡T
dt y± (t C d) fpre ( t0 C t ) fpost ( t0 ) ,
(3.1)
with some appropriate value tN between t ¡ T and t. Assuming that the individual weight changes in equation 2.2 are small, we obtain, on a timescale that is large with respect to T, the differential equation Z 1 4w ( t ) dw (t) ¼ (3.2) D w ( t) dt y± (t C d) C (t I t) , dt T ¡1 with correlation function Z 1 t C (t I t) D dt0 fpre ( t0 C t ) fpost ( t0 ) . T t¡T
(3.3)
Observe that the integral domain in equation 3.2 can be extended to innity since T is larger than the width of y± . Note further that in equations 3.1 and 3.2, we assume that the decay of the synaptic memory (i.e., of w) is on an even longer timescale, so that all the changes are accumulated without degradation. Next, we require that the instantaneous pre- and postsynaptic ring rates are correlated in the sense that there is a tendency for the postsynaptic neuron to re around 4 t § & ms after or before a presynaptic spike (after for 4t < 0, before for 4t > 0). Thus, we assume a correlation function of the form C (t I t) D cG (4 t, &, t ) C f pre f post (t ) ,
(3.4)
Activity-Dependent Development of Axonal and Dendritic Delays
593
with a constant presynaptic mean frequency f pre ; a postsynaptic mean frequency f post ( t) , which may depend on the synaptic weight w ( t) but is not correlated with the presynaptic spikes; a spike correlation factor c ¸ 0; and ) ) being a gaussian distribution around G ( 4t, &, t ) D (2p &2 ) ¡ 2 exp(¡ ( t¡t 2&2 4t with standard deviation &. For the specic choices of the learning function 2.2 and the correlation function 3.4, the integral in equation 3.2 can be analytically calculated. One obtains 1
4
2
dw ( t) D w ( t) ( cy& ( 4t C d) ¡ y ± f pre f post ( t) ) , dt R1 where y& ( 4t) D ¡1 y± (t ) G (4 t, &, t ) dt is approximately 8 > >
> :¡
(4t) 2
3
| 4t|e
¡
2(a2 C & 2 )
3
| t|e
¡
2(b 2 C & 2 )
c a2 (a2 C &2 ) 2 c b2 (b 2 C &2 ) 2
4
,
4
,
4
(3.5)
t< 0 (3.6)
(4t) 2
t ¸ 0,
R1 and ¡y ± D ¡1 dt y± (t ) D ¡c (b ¡a) . Note that for b > a, one has ¡y ± < 0. A negative integral of the learning function was postulated in Abbott and Song (1999) and Kempter et al. (1999) to normalize the synaptic weights and stabilize the postsynaptic potential in a subthreshold regime. There is in fact later experimental evidence that for synaptic connections onto pyramidal cells in layer II of the rat barrel cortex, the width of the LTD window, b, is larger than the one of the LTP window, a, and that the overall integral over the learning function is negative (Feldman, 2000). The exact form of y& is a smoothed version of equation 3.6, and the approximation is strict for a D b. The shape of y& is thus again ofpthe original form depicted in Figure 2a, but now with maximum at 4t D ¡ a2 C &2 and minimum at 4t D p 2 b C &2 . Note that for vanishing variance in the spike time differences, & D 0, one retrieves equation 2.2. The effective learning window for a synapse exposed to gaussian distributed spike time differences appears therefore to be downscaled and broadened. To close the feedback loop, we split the uncorrelated postsynaptic ring rate arising in equation 3.4 into a part originating from the specic presynaptic neuron and a constant background rate due to the remaining afferents, X o (3.7) f post ( t) D w ( t) f pre C f post, d
where the sum extends over the different synaptic delay lines from the specic presynaptic neuron (with weights w D wd ). In writing a linear sum, we assume that the postsynaptic neuron is operating in a linear regime
594
Walter Senn, Martin Schneider, and Berthold Ruf
of its input-output function. We also assumed that the specic presynaptic neuron affects the correlation function 3.4 only by the overall synaptic input without considering the temporal structure of the interaction (but compare equation 6.1). Inserting equation 3.7 into 3.5 yields " # dw (t) 2 X o D w ( t) cy& ( 4t C d) ¡ y ± f pre w (t) ¡ y ± f pre f post . dt d
(3.8)
Observe that in the presence of noncorrelated spike activity, c D 0 in equation 3.4, the synaptic weight would asymptotically decay toward zero due to the negative overall integral over the learning function. Furthermore, the synaptic strength cannot grow to innity due to the negative feedback loop represented by the second term in the brackets. Finally, we consider a family of delay lines whose weights may stochastically uctuate among nearest neighbors according to equation 2.3. In the limit of a continuum of delay lines, the uctuations result in a diffusion term that adds to equation 3.8, µ ¶ Z 1 dw (d , t ) 2 o D w (d , t) cy& ( 4t C d) ¡ y ± f pre w (d , t) dd ¡ y ± f pre f post dt ¡1 C2
@2 w (d , t) @d 2
.
(3.9)
To reveal the structure of this population equation for the synaptic weights, we rewrite it in the form dw @2 w D w[y& ( 4t C d) ¡ c1 hwi ¡ c2 ] C 2 , dt @d 2
(3.10)
R1 2 o with hwi D ¡1 w (d , t) dd and constants c1 D y ± f pre and c2 D y ± f pre f post . The correlation factor c was absorbed in the learning function y& . 4 Induced Delay Shift for Fixed Spike Correlations
Let us consider the supervised learning scenario by imposing the stationary spike statistics, equation 3.4, between the pre- and postsynaptic neuron. Thus, besides some uncorrelated component, the time differences of the signals at the synapse with delay d are gaussian distributed around 4t C d with standard deviation &. We are interested in the dynamics of the relative delays d, which are implied by the weight modication 3.9. To gain insight into this dynamics, let us rst neglect the stochastic renewal process and consider the synaptic modication in the form 3.10 with 2 D 0. Due to the normalization term ¡c1 hwi in equation 3.10, the delay adaptation is then purely governed by a selection process, selecting those
Activity-Dependent Development of Axonal and Dendritic Delays
595
connections that experience the strongest potentiation. (To prove this, we rst observe that due to the negative feedback induced by the uncorrelated spikes—the term ¡c1 hwi—and the saturation toward zero—the factor w— the weights w (d , t ) do always converge to a steady state. In this limit, either w (d , t ) D 0 or y& ( 4t C d) ¡ c1 hwi ¡ c2 D 0. Since ¡c1 hwi ¡ c2 is independent of d, there are only discrete values of d satisfying the latter equation. Now let us assume that the delay line da with 4t C da D ¡a representing the maximum of y& initially has a positive weight, w (da , 0) > 0. Since according to equation 3.10 weights with y& ( 4t C d) ¡ c1 hwi ¡ c2 > 0 are potentiated while others are depressed, we conclude that in the steady state, the delay line da survived. Following the reasoning above, we then must have y& ( 4t C da ) ¡ c1 hwi ¡ c2 D 0, while for all other delay lines, w (d , t) D 0. In general, exactly those delay lines survive for which, among the initial weight distribution, the value y& (4 t C d) is maximal (and positive).) If we next consider stochastic uctuations of the delays, 2 > 0 in equation 3.10, then the delays may shift beyond their initial regime, and delays can nally be selected that were not present among the initial conguration. As can be guessed from the analysis, the relative delays d in general will shift such that the average time difference at the synaptic sites becomes 4t C d D ¡a. To formalize this statement, we consider the average delay d dened by the center of gravity of the d distribution, R1
d ( t) D R¡1 1
dw(d, t) dd
¡1
w (d , t) dd
D
hdw(d, t) i , hw(d, t) i
(4.1)
where the brackets represent the integral over d. Note that d D Dax C Dsyn ¡ b
Dden. The dynamics of d is obtained by differentiating equation 4.1 with respect to time, P P hd wihwi ¡ hdwihwi 1 (hd wi P ¡ dhwi P ), dP D D hwi2 hwi
(4.2)
and inserting for wP the weight modication 3.10. As we show in Section A.1, the equation can then be reduced to d d ¼ s 2 y&0 ( 4t C d) , dt
(4.3)
where y&0 is the derivative of the learning function 3.6. The dynamics 4.3 has exactly two stationary solutions, d a and d b , corresponding to the zeros p p 2 2 2 4t 4 4 4 b C &2 of y&0 (cf. syn D t C d a D ¡ a C & and tsyn D t C d b D Figure 2a). To investigate the stability, we consider the second derivative of p p 00 (¡ 00 2 2 2 y& at these points and nd y& a C & ) < 0 and y& ( b C &2 ) > 0. We conclude that the delay d a is attracting with domain of attraction (¡1, d b ),
596
Walter Senn, Martin Schneider, and Berthold Ruf
while the delay d b is repulsive. The standard deviation of the delay distribution at the attracting steady state can qualitatively be approximated by s s¼
4
22 p D 00 | y& (¡ a2 C & 2 ) |
³ p ´ 14 a 2 e p , 2 c a C &2
(4.4)
(see Section A.2). Since the s steeply increases with small 2 , we conclude from equation 4.3 that small delay uctuations are enough to make the delay drifting. Moreover, due to the term y&00 , the steady-state weight distribution is broad for learning functions with a at peak. To summarize, for xed and nonat pre- and postsynaptic spike correlations, the synaptic pathways develop such that a presynaptic signal on average meets a postsynaptic signal at the synaptic site with a time difp ference of 4 tsyn D ¡ a2 C &2 ms, where a is the peak of the LTP function and & is the width of the gaussian part in the spike correlation function 3.4. In particular, the average spike time difference 4t can be compensated by appropriate delays Dax and Dbden as long as, on average, the presynaptic p signals arrive at the synaptic site not later than b 2 C & 2 ms after the backpropagated spike (cf. Figure 2). Two assumptions are crucial for these delay adaptations: (1) the small stochastic uctuations in the delays and (2) the negative integral over the learning function leading to a normalization of the synaptic weights through the uncorrelated spikes. 4.1 Simulation Results. To examine the quality of the approximation 4.3, we simulated a stochastic version of equation 3.10 with discrete weight updates and discrete random delay uctuations after each pairing of preand postsynaptic spikes. The spike time differences 4t D tA ¡ tB were sampled from a gaussian distribution with mean 4t D ¡20 ms, standard deviation & D 3 ms, and a sampling rate of f pre D 20 Hz. We have chosen seven delay lines with axonal delays Dax evenly distributed between 9.4 and 11.6 ms and 17.4 and 18.6 ms, respectively (left and right rectangular curves in Figure 3a). The synaptic delay and the backward dendritic delay were xed to 1 ms, Dsyn D Dbden D 1, so that d D Dax . After each spike pair, we evaluated the learning function 2.2 and set for the weight change 4w (d , t) D w (d , t) y± ( 4t C d) . The parameters for y± were a D 5 ms, b D 7 ms, and c D 3.5. The allowed range for the relative delays d was between dmin D 9 and dmax D 21 ms, with mesh width 4d D 0.2 ms. The additional stochastic delay uctuations were implemented by replacing a synaptic weight w (d , t) after each presynaptic spike with probability 2 D 0.1 by one of its two neighbors w (d § 4d, t) ; see equation 2.3. Finally, the different weights were reduced after each pairing by ¡c1 w (d , t) hwi with c1 D 0.3 and P 4 (d )4 hwi D 60 i D 0 w min C i d, t d. The remaining parameter in equation 3.10 was o set to c2 D 0, thus assuming f post D 0.
Activity-Dependent Development of Axonal and Dendritic Delays
597
Figure 3: Axonal delay adaptation with xed synaptic and dendritic delays ( D 1 ms) and with gaussian spike time differences uctuating around 4t D ¡20 ms. (a) Evolution of the axonal delays Dax for two different initial distributions (left and right rectangular curves), shown after 120 ms (corresponding to two pairings, circled curves) and after 20 sec (corresponding to 400 pairings, curves with dots) when converged to a gaussian distribution centered at Dax D 14.2 ms. (b) Time evolution of the average axonal delay D ax ( D d) corresponding to the simulations in a (blurred lines) and according to the approximation 4.3 (dashed lines). The steady state for Dax is well predicted by equation 4.3, according to which Dax adapts such that eventually the argument of y&0 , 4 t C D ax , is equal
p
to the zero ¡ a2 C &2 of y&0 (which itself corresponds to the peak of y& ; cf. Figurep2a). In fact, in the steady state, we have 4 t C Dax D ¡20 C 14.2 D ¡5.8 and ¡ a2 C &2 D ¡5.83. Without jitter in the spike time differences (& D 0), the average axonal delay Dax would be roughly 1 ms longer (dot at 100 sec). Other parameters: a D 5 ms and & D 3 ms.
The simulation was run for 100 seconds biological time (which is determined by the learning rate c ). From both rectangular initial distributions, the weights w (d , t) converged to a gaussian distribution centered at d D Dax D 14.2 ms with standard deviation s D 0.8 (see Figure 3a). The time course of Dax , obtained by evaluating equation 4.1 with the corresponding weights from the simulation, can be matched with the dynamics 4.3 with s D 0.2 (see Figure 3b). Although, due to the different approximation steps, the two s’s do not coincide (the one from equation 4.4, for comparison, would be 0.31), the steady state for the average axonal delay (¼ 14 ms) is well predicted by the dynamics 4.3. Note that the jitter in the spike times shortens the average axonal delay (the dot in Figure 3b). 5 Induced Delay Shift for Variable Spike Correlations
We next consider an unsupervised learning scenario according to which the presynaptic neuron may directly inuence the postsynaptic spike time tB ,
598
Walter Senn, Martin Schneider, and Berthold Ruf
which in turn affects the synaptic strength from the considered presynaptic neuron. To investigate this feedback loop, we rst consider the simplied scenario where the postsynaptic spike is exclusively triggered by the presynaptic input. We further simplify matters by xing the axonal and synaptic delays to some constant value, say, Dax D Dsyn D 0, and focus on the adaptation of the dendritic delays and latencies. To take account of the dependencies among the different dendritic delays, we parameterize the synapses on the dendritic tree by their distance D from f
the soma and identify this distance by the forward dendritic delay, Dden ´ D. Since distal synapses induce a longer (and smaller) somatic EPSP, we assume for the rise time a monotonically increasing function of the distance, trise D R ( D) . For the backward dendritic delay, we assume a monotonically increasing function, Dbden D B (D ) . The somatic EPSP induced by synapse D is assumed to have the form E R ( D) ( t ) D
t R2 ( D )
t
¢ e¡ R(D) , for t ¸ 0,
(5.1)
and ED ( t) D 0 for t < 0 (see the inset of Figure 4b). To obtain the time course of the somatic voltage induced by a single presynaptic spike at time tA , we integrate over all synaptic connections, Z V ( t) D
1 0
w ( D, t) ER ( D ) (t ¡ ( tA C D) ) dD,
(5.2)
where w ( D, t) denotes the synaptic weight of the connection from neuron A to neuron B with forward dendritic delay D (recall that Dax D 0). We further assume that the postsynaptic neuron is sufciently depolarized such that the input from neuron A alone may push the postsynaptic voltage across some xed threshold h. The postsynaptic spike time tB is then given by the rst crossing of h. If h is in the upper third of the maximal depolarization, equation 5.2, say (cf. also the inset in Figure 4), and the delay distribution is not too wide, we may roughly estimate lat
tB ¼ tA C D C R D tA C Dden.
(5.3)
Here, D and R represent the average forward dendritic delay and the average lat
rise time of the somatic EPSPs, respectively, and Dden ´ D C R is the average dendritic latency. The averages are taken with respect to the synaptic weight distribution. For instance, for the average dendritic latency, we have lat Dden ( t)
D
R1
lat ( ) ( ) ¡1 Dden D w D, t dD R1 , ( ) ¡1 w D, t dD
(5.4)
Activity-Dependent Development of Axonal and Dendritic Delays
599
Figure 4: Adaptation of the dendritic latency, with postsynaptic spikes triggered by the specic presynaptic neuron. (a) Evolution of the weight distribution as a f C trise , with snapshots of the initial function of the dendritic latency Dlat den D Dden distributions (left and right rectangular curves), after 120 ms (slightly deformed lat
circled curves), and after 60 sec (nearly identical gaussians centered at Dden ¼ 8 lat Dden
f
ms). (b) Time evolution of the average dendritic latency D Dden C trise corresponding to the two simulations in a (full lines) and the approximation 5.5 with s D 0.2 (dashed lines). Superimposed is the evolution of the spike time difference tB ¡ tA (dashed-dotted lines) for the two simulations in a. The average dendritic latency is implicitly adapted such that in the steady state, the den
lat
b
argument of y±0 is equal to the rst zero, Dtot ´ Dden C Dden D a (cf. equation 5.5 and Figure 2a). In fact, from b
den
lat Dden
b
f
¼ 8, trise D 4 and D den D 12 Dden one calculates
Dden ¼ 2, Dtot ¼ 10, and this is in the range of a D 10.5 (cf. also Figure 2b). The inset shows the normalized EPSP according to equation 5.1) with the threshold crossing at 34 trise (horizontal and vertical dashed lines). lat
( ) C R ( D) . To estimate the shift in Dden induced by the with Dlat den D D D lat
synaptic modications, we have to relate Dden with the average synaptic time difference 4tsyn. From equation 2.1 we get 4tsyn ( D ) D tA ¡ ( tB C B (D ) ). lat
Setting tA D 0 and inserting equation 5.3 yields 4tsyn D ¡Dden ¡ B, where B is the average backward dendritic delay dened similarly to equation 5.4. If the synaptic weights w ( D, t ) are now subject to the dynamics 3.9, with d replaced by D, we obtain (see Section A.3) d lat B0 ( D ) (1 C R0 ( D ) ) lat . Dden ¼ ¡s 2g y±0 (¡Dden ¡ B ) , with g D dt 1 C R0 ( D ) C B0 (D )
(5.5)
Here, s represents the standard deviation of the distribution of the forward dendritic delays D. It is given by equation 4.4 with & D 0.
600
Walter Senn, Martin Schneider, and Berthold Ruf
The unique attracting steady state of equation 5.5 is again given by the lat
zero of y±0 with y±00 < 0, that is, by 4tsyn D ¡Dden ¡ B D ¡a. Thus, assuming small, unbiased uctuations in the synaptic positions (encoded by D) together with repeatedly suprathreshold input from the considered presynaptic neuron, the synaptic population slowly shifts toward an average den
lat
position D with total average dendritic delay Dtot ´ Dden C B ¼ a. Observe that the derivative of B ( D) enters as a factor in equation 5.5 and that therefore the postsynaptic delay is adapted only if the backward delay is a nonconstant function of the synaptic position. The reason is that we assume a simultaneous occurrence of the synaptic releases, and if the backward dendritic delay would be the same for all synapses, each synapse would see the same local time difference between the forward and backward signal. For B (D ) D const, therefore, no shift in the average dendritic latency is expected. Note further that the sign in equations 4.3 and 5.5 is different. This reects the fact that in order to change from an initial 4 tsyn with ¡a < 4tsyn < 0 to the nal 4tsyn D ¡a, either the axonal delay has to decrease, as in equation 4.3 (at xed spike times tA , tB ), or the dendritic latency has to increase, as in equation 5.5 (and thereby increasing tB ). 5.1 Simulation Results. To test the dynamics 5.5 qualitatively for the lat
average dendritic latency Dden, we simulated again the discretized version of equation 3.10. We initially chose eight and seven delay lines with dendritic latencies Dlat equally spaced within 4.2 ¡ 5.6 ms and 10.4 ¡ 11.6 den ms, respectively (see Figure 4). The dendritic latency was decomposed of a constant EPSP rise time R ( D) D trise D 4 ms and a corresponding forward C dendritic delay D with Dlat . The backward dendritic delays were den D D trise 1 ( ) half the forward ones, B D D 2 D. We further set & D 0 and Dax D Dsyn D 0. The presynaptic cell was stimulated with a frequency of 20 Hz, and whenever the sum of EPSPs crossed the threshold h D 6.5 (cf. equation 5.2), a postsynaptic spike was elicited. The weight changes were calculated by evaluating y± at 4tsyn D tA ¡tB ¡B(D) , where tA and tB are the corresponding spike times of the pre- and postsynaptic neuron. The parameters in y± were a D 10.5, b D 14, c D 0.7. To implement the other terms in equation 3.10, we reduced the weights by ¡c1 w ( D, t) hwi with c1 D 0.3 and c2 D 0, and implemented the stochastic delay uctuations as described in the previous section. The simulation was run for 60 seconds biological time. From both initial distributions, the weights w ( D, t) converged to a gaussian distribution centered at D D 4.8 ms (see Figure 4a). The time course of the spike time lat
differences 4t D tA ¡ tB and of the average dendritic latency Dden extracted from the simulation described above are shown in Figure 4b. Superimposed lat
is the evolution of Dden (t ) obtained from the approximated dynamics 5.5. According to the steady state of equation 5.5, the average total dendritic
Activity-Dependent Development of Axonal and Dendritic Delays
601
lat
delay, Dden tot ´ Dden C B ( D ) ¼ D C trise C B ( D ) ¼ 11.2 ms, is related to the rst zero of y±0 , that is, to a D 10.5. The difference arises from the fact that the postsynaptic spike is typically triggered before the EPSP culmination, and the effective dendritic latency is therefore smaller than the one considered lat
(see the inset of Figure 4b). Note that the time course of Dden is roughly three times slower than that of Dax (see Figure 3b) as predicted by the factor g D 13 obtained from the formula in equation 5.5. 6 Coevolution of Axonal Delays and Dendritic Latencies
In the previous simulations, we xed either the dendritic or the axonal delay, while only uctuations in the other delay were considered. Under this restriction, we showed that each of the delays will separately evolve until the optimal timing imposed by the learning function is reached. For xed dendritic delays and xed spike correlations, the axonal delays adapt such that the induced EPSPs peak at the time the postsynaptic neuron is expected to re. For xed axonal delays, in turn, the dendritic delays are adapted such that they nally match the width of the learning window. Such a self-organization toward unique delays, however, is no longer possible if the uctuations would affect both the axonal and the dendritic delays at the same time. This is because the synaptic strength is changed based only on the local time difference measured at the synaptic site, and this may be the same for different pairs of axonal and dendritic delays. In fact, according to equation 2.1, the local time difference is the same for all axonal and backward dendritic delays with d D Dax ¡ Dbden D const and the stability condition 4 tsyn D 4t C d D ¡a (see the discussion after equations 4.3 and 5.5) is formally met for all pairs of the form ( Dax C2 , Dbden C2 ) with 2 > 0. The same degeneracy is also present if Dax conuently changes with the width a of the learning function. In either case, synapses are going to be strengthened that are not causally contributing to the postsynaptic spike, although they see the pre- and postsynaptic signal in the optimal timing (i.e., with a ms delay). It turns out that the unreliability of the synaptic transmission is the property that prevents an acausal development by disadvantaging synapses that are only “blind passengers” and do not contribute to the postsynaptic spike. To reveal this point, we rst observe that if the synaptic modication is induced, for example, by via activation of the postsynaptic NMDA receptors, a synaptic release, and not just a presynaptic spike, must occur to cause the synaptic modication. Let us now consider two unreliable delay lines with the same release probability P rel < 1 and the same relative delay d, but different axonal and dendritic delays and thus different total forward delays. Let us assume that the EPSP of the rst delay line falls preferentially together with a subthreshold depolarization caused by a third input onto the same postsynaptic neuron. In this case, the EPSP from the rst delay line is expected to trigger more often a spike than the EPSP from the second. This
602
Walter Senn, Martin Schneider, and Berthold Ruf
implies that compared to the second delay line, the synaptic release from the rst line is stronger correlated with a postsynaptic spike, and since the occurrence of a release is a prerequisite to induce a synaptic change, the rst delay line is more often upregulated. The second delay line, which shows the same number of releases with the same local time difference, is less upregulated since these releases do not exhibit the strong correlation with an immediately following spike. Additional uncorrelated postsynaptic spikes may downregulate both weights such that only the rst delay line survives. To formalize this reasoning in the general case, we reparameterize the f synaptic weights according to w ( Dax , D) , where D ´ Dden again abbreviates the forward dendritic delay. We consider a postsynaptic priming scenario with additional synaptic input from other neurons onto B roughly 4t after the activity of the presynaptic neuron A. In addition, we now assume that the individual releases may also inuence the timing of the postsynaptic spikes in a statistical sense. To simplify matters, we again assume a linear transfer function with threshold at 0 and, to reduce the number of constants, with gain 1. Instead of the correlation between the (instantaneous) pre- and postsynaptic ring rate, it is now the correlation between the presynaptic release rate and the postsynaptic ring rate given a release that determines the synaptic weight change. To describe the weight change of the synapse (Dax , D ) formally, we introduce the postsynaptic ring rate fpost (t , Dax , DI t) conditional to a presynaptic spike at time t C t and a subsequent release at synapse ( Dax , D ). In extension of equation 3.7, this conditional postsynaptic ring rate is f post (t , Dax , DI t) D f pre w ( Dax , DI t) ER ( D ) (t C Dax C D ) C ¢ ¢ ¢ X C Prel f pre w ( D0ax , D0 I t) 6 Dax ,D0 D 6 D D0ax D
o
£ER ( D0 ) (t C D0ax C D0 ) C f post .
(6.1)
Note that since we assume a release at synapse ( Dax , D) , no factor Prel occurs in the rst term on the right-hand side. In favor of the notational load, we o represents the contribution of uncorrelated set Dsyn D 0. The last term fpost afferents to the postsynaptic ring rate. By considering the effect of the individual spikes onto the instantaneous postsynaptic rate, we now obtain an additional dependency of the spike correlation function 3.3 on the individual synapses and the relative presynaptic spike times t . Substituting f pre ( t) with f rel ( t) and f post ( t) with fpost (t , Dax , DI t) in equation 3.4, we obtain a conditional release-spike correlation function for synapse ( Dax , D) of the form C ( Dax ,D ) (tI t) D cG (4 t, &, t ) C f rel f post (t , Dax , DI t ).
(6.2)
Activity-Dependent Development of Axonal and Dendritic Delays
603
Recall that the rst term represents the contribution to the correlation function caused by a postsynaptic spike induced by other input. Instead of equations 3.2 and 3.8 we now get Z 1 dw ( Dax , DI t) D w ( Dax , DI t) dt y± (t C d) C ( Dax ,D ) (t I t) ¼ w ( Dax , DI t) dt ¡1 2 4cy& ( 4t C d) C (1 ¡ Prel ) f w ( Dax , DI t) yR ( D) (¡Dden ) tot pre 3 ¢ ¢ ¢ C f rel
X D0ax ,D0
0
0
0f
w ( Dax , D I t) yR ( D0 ) (¡Dtot C d)
o ¡ y ± f rel f post 5 ,
(6.3)
where Dden tot D D C R ( D ) C B ( D ) is the sum of forward dendritic delay plus the EPSP rise time plus backward dendritic delay, d D Dax ¡ B ( D ) , and 0f Dtot D D0ax C D0 C R (D0 ) is the total forward delay of another delay line. The ¼ in equation 6.3 comes from the approximation of the EPSPs by gaussian 0f
functions of normalized area, with center t C Dtot and standard deviation R ( D0 ), so that the integrals reduce to the evaluation of yR ( D0 ) given by equation 3.6. For Prel D 1 we recover in equation 6.3 our problem that the learning rule cannot distinguish between delay lines (Dax , D ) with common relative delay d (since then the bracket in equation 6.3 is independent of Dax and D). If the synapses are unreliable, however, the second term in the brackets becomes positive, and the synaptic weights evolve differently for each delay line. Due to this positive feedback term, delay lines for which the value yR ( D ) (¡Dden advantage on the other delay tot ) is large have an evolutionary p 2 C lines with the same d. Provided that a R ( D ) 2 does not vary too much, the valuepof yR ( D ) is largest for that delay line satisfying Dden tot ´ D C R ( D ) C 2 2 B ( D) ¼ a C R ( D) . Note that the analogous condition is satised for the attracting steady state of the dynamics 5.5, which we deduced under the assumption of xed axonal delays. On top, the rst term in the brackets favors delay lines for which y& (4 t C d)pis maximal and thus delays satisfying 2 2 4t syn ´ 4t C d D 4 t C Dax ¡ B ( D ) ¼ ¡ a C & . Together with the previous condition, this xes the optimal axonal delay Dax , as well as the optimal forward dendritic delay D, characterizing the delay line. Recall that the latter condition is also satised in the attracting steady state of the dynamics 4.3. Note further that combining the two conditions while assuming R (D ) ¼ & f yields Dtot ´ Dax C D C R ( D) ¼ ¡4 t, and that this is also compatible with the third term within the bracket in equation 6.3. Thus, the learning rule prefers a total dendritic delay corresponding to the width of the effective learning function and a total forward delay corresponding to the the average spike time differences between the pre- and postsynaptic neuron (see Figure 5).
604
Walter Senn, Martin Schneider, and Berthold Ruf
Figure 5: Sketch of the joint adaptation of the axonal and dendritic delays in the postsynaptic priming protocol. (a) Conguration of the different delays before learning. (b) Repetitive stimulations of neuron A together with an increased background ring rate (“priming”) of neuron B during the subsequent depolarization | 4 t| later will lead to a drift of the axonal and dendritic delays such that eventually the average total dendritic delay matches the width of the learning den
f
function, Dtot ´ Dden C trise C Dbden ¼ a, and the average total forward delay f
f
matches the time delay of the depolarization, Dtot ´ D ax C Dden C trise ¼ | 4t|.
By comparing equations 3.8 with 6.3, we see that the negative feedback loop preventing the synaptic weights from growing to innity is lost when we consider the spike correlation induced by the presynaptic neuron. The stability property is regained if we consider synaptic projections from other neurons being subject to the same type of synaptic modications of the weights. In fact, if the spike activity among the presynaptic neurons has an uncorrelated component, an overall increase in the synaptic weights leads to o an enhanced component f post and this establishes a negative feedback loop through the last term in equation 6.3. Recall that a positive y ± is equivalent to a negative integral over the learning function y± . Alternatively, we may postulate that with each postsynaptic spike, the synaptic weight slightly degrades, independent of any spike timing (Kempter et al., 1999). In this case, we obtain an additional term in the brackets of equation 6.3, which is proP o portional to ¡( f post C f rel D0 ,D0 w ( D0ax , D0 I t) ) . Since this term is equal for ax all delay lines, it acts only as a normalization without distorting the weight distribution and therefore changes neither the average axonal nor dendritic delays. Finally, small, unbiased stochastic uctuations in the axonal and dendritic delay will cause the average delays to move toward the steady p f state given by the above conditions Dden a2 C R (D ) 2 and Dtot ¼ ¡4t. tot ¼ Similarly to the last term in equation 3.9, these stochastic ± uctuations ² will appear in equation 6.3 as a diffusion term of the form 2
@2 w @D2ax
C
@2 w @D2
.
Activity-Dependent Development of Axonal and Dendritic Delays
605
6.1 Simulation Results. To test whether the unreliability of the synaptic transmission indeed helps to select a specic conguration among the two-dimensional parameterization of delay lines, we combined the “supervised” and “unsupervised” learning scenario from the previous sections. We initially distributed 20 delay lines with axonal delays Dax between 8.5 f and 11.5 ms and forward dendritic delays D ´ Dden between 4.5 and 7.5 ms and set for the corresponding synaptic weights w (Dax , D ) D 0.4, while the remaining weights were zero. The backward dendritic delay of each delay line was half the forward dendritic delay, B (D ) ´ Dbden ( D ) D 12 D, and the EPSP rise time was set to R ( D ) ´ trise D 3 ms. The common release probability was Prel D 0.5, and the threshold of the postsynaptic neuron was set to h D 20 mV. We stimulated the presynaptic neuron with a periodic train of f pre D 20 Hz and primed the postsynaptic cell with a 10 mV subthreshold depolarization between 25 and 27 ms after each presynaptic spike (cf. Figure 5). This induces the additional spike correlation described in equation 6.2 with 4t D 26 ms and & ¼ 1 ms. Whenever the postsynaptic potential, composed of this depolarization and the sum of the evoked EPSPs (cf. equations 5.1 and 5.2) crossed the threshold, a postsynaptic spike was elicited unless a spontaneous spike was already triggered within 5 ms before. The postsynaptic cell showed a spontaneous background ring rate o f post D 3 Hz, which was increased to an instantaneous Poisson rate of 30 Hz during the 2 ms period of the additional depolarization. For each pair of synaptic release and postsynaptic spike, we modied the corresponding synaptic weight according to 4w D wy± ( 4tsyn ) with learning function y± given by equation 2.2 and local time difference 4tsyn given by equation 2.1. The parameters of y± were a D 14 ms, b D 20 ms, and c D 0.8. To prevent an innite synaptic growth, we reduced the synaptic weights after each postsynaptic spike by 0.01 times w times the sum of all synaptic weights, as discussed at the end of the last paragraph. To simulate the stochastic uctuations, we assumed a two-dimensional mesh grid of axonal and dendritic delays ( Dax , D ) with a granularity of 0.2 ms. After each presynaptic spike, we redistributed each synaptic weight with probability 0.1 by 80% of its own weight and 5% of the weights of the four neighbors. The simulation was run for 35 minutes biological time. Although initially the presynaptic neuron could not contribute to the ring of the postsynaptic cell, it learned to do so after 28 minutes of correlated spike activity (see Figure 6a). After this period, the postsynaptic spikes were always evoked by the EPSPs from the presynaptic cell, which were strengthened by the f
rule. The connections adapted their total forward delay such that Dtot ´ f
Dax C Dden C trise D 27.0 ¼ ¡4t D 26 ms, and this led to a peak of the EPSPs during the period of the additional postsynaptic depolarization (see Figure 6b and the dashed-dotted line in Figure 7a).
606
Walter Senn, Martin Schneider, and Berthold Ruf
Figure 6: Postsynaptic activity during the priming protocol (cf. Figure 5). (a) Raster plot of the postsynaptic spikes. Only each tenth trial is shown. During the time interval 25–27 ms after a presynaptic spike, a postsynaptic depolarization increased the probability of a spontaneous postsynaptic spike. After 28 minutes of simulated time (corresponding to »34,000 pairings) the presynaptic neuron learned to trigger the postsynaptic spike during the period of the postsynaptic depolarization (darker points top right). (b) Same as in a but superposed with the sum of EPSPs induced by the presynaptic neuron. The horizontal lines indicate the period of the additional postsynaptic depolarization. For parameter values, see the text.
The simulation shows that it is possible to learn the peak EPSP implicitly with a precision of almost 1 ms, although the width a of the learning function itself is more than 10 times broader. To explain this fact, we inspect the different dendritic delays after reaching the steady state (see Figure 7a). Toward 30 minutes simulation time, the total dendritic delay eventually den
f
b
fully covers the learning window, Dtot ´ Dden C trise C Dden D 13.5 ¼ p 2 a C ( trise ) 2 D 14.3 ms, as predicted by the theoretical consideration above (see Figure 7a; cf. the sketch in Figure 5). Independent of the width of the LTP f
branch, the total forward delay Dtot adapts toward the typical spike time difference 4t and thus eventually supports existing temporal structures in the spike activity, as long as the width a of the learning function can be den
absorbed by the total dendritic delay Dtot . Finally, the stochastic synaptic transmission (Prel < 1) led to a unique average axonal delay Dax ¼ 17 ms and a unique average forward denf
dritic delay Dden ¼ 7 ms after the application of the stimulation protocol (see Figure 7b). Without transmission failures (Prel D 1), an elongated ridge f would evolve in the (Dax , Dden ) -space corresponding to those delays with xed synaptic time difference 4tsyn between the pre- and postsynaptic signals. This shows that unreliable synapses are in fact necessary for a unique development of the axonal and dendritic delays.
Activity-Dependent Development of Axonal and Dendritic Delays
607
Figure 7: Implicit adaptation of the different delays during the postsynaptic priming protocol (cf. Figure 5). (a) The evolution of the mean axonal delay and mean dendritic latency during the simulation shows that the synaptic modication implicitly adjusts the different delays according to equations 4.3 and 5.5, such that eventually the average total forward delay matches the averf
lat
age spike time difference, Dtot D D ax C Dden D | 4t| D 26 ms, and such that the total average dendritic delay matches the width of the learning function, den
lat
b
lat
f
Dtot D Dden C Dden ¼ a (with Dden D Dden C trise). (Lower thick line: Dax , f
f
f
dots: Dax C Dden, dashed-dotted line: Dtot D Dax C Dden C trise , upper thick line: f
b
Dtot C Dden, upper dots: Dax C a, shaded band: time of subthreshold input.) (b) Distribution of the axonal and forward dendritic delays at the initial conguration (left blur) and after 35 minutes of the simulated time (right blur). The gray level is coding the synaptic weight corresponding to the different delays. Without transmission failures, the nal distribution of the delays would be smeared out within the two dashed lines.
7 Discussion
The work presented here extends the idea of delay selection by showing that slow stochastic uctuations in the axonal and dendritic delays lead to delay drifts beyond the original delay distributions (cf. Figures 3 and 4). We showed that during correlated pre- and postsynaptic activity, the temporally asymmetric synaptic modication implicitly (1) adjusts the total forward delay composed of the axonal, synaptic, forward dendritic delay and EPSP rise time until it matches the statistically dominant time difference between the pre- and postsynaptic spikes, and (2) tunes the total dendritic delays composed of the forward dendritic delay, the EPSP rise time, and the backward dendritic delay until it ts the width of the effective learning function (cf. Figures 5, 6, and 7). Although the evolution of these delays is perturbed by uncorrelated spontaneous activity, the precision of the delay adaptation may be of an order higher than the width of the learning func-
608
Walter Senn, Martin Schneider, and Berthold Ruf
tion itself. Interestingly, the tight selection of the optimal delays is possible only if the synaptic transmission is unreliable. This unreliability gives the “successful” delay lines an evolutionary advantage over the others. To stabilize the self-organization of the different delays, one must assume that the integral over the learning function is negative. Thus, three properties are important for an activity-dependent development of axonal delays and dendritic latencies through asymmetric weight modications: slow, unbiased delay uctuations, unreliable synapses, and LTD dominating LTP. 7.1 Physiological Evidence for the Hypothesized Mechanisms. The shape of the temporally asymmetric learning function (see equation 2.2 and Figure 2) is motivated by different recent experiments (Markram et al., 1997; Zhang, Tao, Holt, Harris, & Poo, 1998; Bi & Poo, 1998; Feldman, 2000). These experiments do not allow discriminating among axonal, synaptic, and dendritic delays as dened in equation 2.1. Instead, they provide the change in the synaptic strength as a function of the spike time differences tA ¡ tB between the two neurons recorded from. In these experiments, the neurons lay nearby, and in case of cortical recordings, they mostly synapse on each other ’s proximal dendritic tree. As a consequence, the axonal and dendritic delays appear to be small, and the time difference at which maximal upregulation is observed corresponds roughly to our time difference a dening the peak in the learning function 2.2. Depending on the experimental study, this peak lays between ¡25 and 0 ms. Different sources for the pre- and postsynaptic delays exist. Axonal propagation velocities for horizontal cortico-cortical connections are on the order of 0.2 m per second, corresponding to a propagation time of 50 ms for cells that are 1 cm apart, while the propagation velocities of vertical bers are roughly 10 times faster (Bringuier et al., 1999). Considerable delays may also be present postsynaptically. In passive dendritic structures, delays up to 30 ms and more were calculated for the time difference between the centroids of an injected current, the locally induced EPSPs, and the EPSP propagated into the soma (Agmon-Snir & Segev, 1993). The peak-to-peak time from a distal EPSP to the forward propagated somatic EPSP and from the somatic AP to the backpropagated dendritic AP can both reach up to 8 ms (Segev, Rapp, Manor, & Yarom, 1992; Stuart & Sakmann, 1994). In addition to these peak-to-peak delays, the signal is delayed by the rise time of the somatic EPSP, which is on the order of 3 to 10 ms or more. Moreover, in active dendritic structures, transient potassium currents can delay the generation of a postsynaptic AP by tens up to hundreds of milliseconds (McCormick, 1991). The small delay uctuations we assume may come along with the appearance of new dendritic spines and synapses observed during induction of long-term modications or their unspecic stochastic disappearing (Engert & Bonhoeffer, 1999; Toni, Buchs, Nikonenko, Bron, & Muller, 1999). Other sources of delay uctuations may be new axonal collaterals bifurcating from an existing synaptic connection and targeting onto a different
Activity-Dependent Development of Axonal and Dendritic Delays
609
dendritic subtree. The axonal and dendritic delays of such new connections may have slightly changed due to different channel densities or morphological properties of the corresponding axonal collateral or the dendritic subtrees (Engert & Bonhoeffer, 1999; Maletic-Savatic, Malinow, & Svoboda, 1999). Variable delays are particularly present at the developmental stage of the central nervous system during which neurites can grow with speeds up to 50 m m per hour (Gomez & Spitzer, 1999). This speed is indirectly proportional to the frequency of the endogenous Ca2 C transients and it is tempting to relate the Ca2 C signals decelerating neurite growth with the Ca2 C signals accelerating synaptic long-term changes once the connections are formed (see Bliss & Collingridge, 1993). Evidence for the formation of new connections during the development is given by the functional differences between the axonal arborization of cortical V1 cells in young and adult cats. In the juvenile animal, horizontal axon collaterals often form connections only in a restricted proximal region, while they are found on distal axon terminals in the adult animal (Katz & Shatz, 1996). In pyramidal cells, dendritic latencies may change during the gated growth of the apical tree toward the pial surface, where it integrates input from the supercial layers (Polleux, Morrow, & Glosh, 2000). The hypothesis that the emergence and selection of interneuronal delays might be guided by a temporally asymmetric learning rule is further supported by the fact that this type of synaptic modication is particularly prominent in embryonic cultures (Bi & Poo, 1998, 1999) and in cortical slices of animals younger than 2 or 3 weeks (Markram et al., 1997; Feldman, 2000), while long-term plasticity cannot be found at synapses from the thalamic input after this stage (Crair & Malenka, 1995; Feldman, Nicoll, Malenka, & Isaac, 1998). There are two possible concerns related to recent experimental ndings that are worth discussing. First, we would like to emphasize that shortterm synaptic plasticity, as found between cortico-cortical pyramidal cells (Markram, Pikus, Gupta, & Tsodyks, 1998), does not fundamentally change the existing picture of delay adaptation. Short-term depression, for instance, is believed to reduce the vesicle release probability as a function of the presynaptic activity, and hence could be easily incorporated in our analysis where the release probability explicitly enters. On a phenomenological level, synaptic depression pronounces changes in the presynaptic ring rates, which may even support the temporal structure between the pre- and postsynaptic cells (compare also Tsodyks & Markram, 1996; Abbott, Varela, Sen, & Nelson, 1997; and Senn, Segev, & Tsodyks, 1998). In this context, the long-term modication of synaptic depression (Markram & Tsodyks, 1996) has a similar effect on delay selection as the long-term modication of the absolute synaptic strength. Second, recent ndings show that for CA1 pyramidal neurons in the hippocampus, the shape of the somatic EPSP is independent of the synaptic location on the dendritic tree (Magee & Cook, 2000). In our framework, this implies that there is no one-to-one correspondence between the synaptic location and the peak of the learning function.
610
Walter Senn, Martin Schneider, and Berthold Ruf
It should be emphasized, however, that experimental support for the location independence of the synaptic response in mammals is present only in hippocampal neurons and, to some degree, in motoneurons (see Magee & Cook, 2000, for references). In contrast, neocortical layer 5 pyramidal cells, for instance, actively attenuate synaptic inputs from the apical tree to the soma and show a wide range of somatic EPSP rise times (Berger, Larkum, & Luscher, ¨ 2001). These neurons can well be integrator and coincidence detector in one (Larkum, Zhu, & Sakmann, 1999), and therefore require a mechanism that determines whether a synaptic input should supply contextual information (by a broad, distally generated EPSP) or timing information (by a sharp, proximally generated EPSP). 7.2 Some Predictions. Synaptic delay lines and their modication have recently been the focus of an experimental study in cultured networks of embryonic rat hippocampus (Bi & Poo, 1999). This work shows that the different mono- or polysynaptic delay lines between two neurons may indeed play a crucial role in the induction of either LTP or LTD. If paired pulse stimulation of a presynaptic neuron triggers an AP in a postsynaptic neuron through a specic pathway, then the strength of that and putative shorter pathways are upregulated, while pathways with putative longer delays are downregulated. Our investigation suggests that simultaneous single pulse stimulations at two different locations with interpulse intervals of 10, 30, and 50 ms would shift, after minutes of repetitions, the (axonal) delays of the pathways toward 10, 30, and 50 ms, respectively. Besides studying network effects, it would be interesting to investigate in vitro synaptic modications at neurons with delayed responses and test whether the learning windows are in fact broadly tuned to match the dendritic latencies. In the context of subthreshold receptive elds (Bringuier et al., 1999), our work predicts an asymmetric distribution of horizontal propagation velocities in the primary visual cortex of cats reared under unidirectional background motions (see Section 1). Another prediction is that due to the self-organization of the synaptic locations, distal synapses with slow somatic EPSPs should have broader learning functions than proximal ones with fast somatic EPSPs. In fact, for certain connections to layer IV spiny stellate cells, the peak of the spike-based temporal modication function is measured to be at tpost ¡ tpre D 60 ms (Cowan & Stricker, 1999). 7.3 A Further Example: Learning Delays Between Cortical Areas. Our analysis also applies to the problem of delay adaptation between neuronal populations with correlated activities. As an example, we may consider the learning of nger sequences, say, for playing a music instrument. The storage and recall of such sequences require a temporal control of the activity in the motor cortex from which the appropriate spinal motor patterns are recruited. According to a current hypothesis, the automation of motor sequences is a form of habit learning, which involves the basal ganglia–
Activity-Dependent Development of Axonal and Dendritic Delays
611
Figure 8: Learning time delays between neuronal populations. (a) Sequences of nger movements, for example, for playing a music instrument, require the temporal control of activity in the motor cortex. For instance, when looking at the notes a ¡ b, the visual stimulus activates through cortico-cortical pathways in sequence the populations A and B in the motor cortex at times tA and tB D tA C | 4t|. When automatizing such nger sequences, the corresponding activity pattern is believed to be stored in the basal ganglia (S), from where it can be retrieved through the basal ganglia–thalamocortical feedback loop. (b) The temporal structure between the activities at the sites A, B, and S in the motor cortex and the basal ganglia, respectively, is similar to that considered for a single synaptic connection (see Figure 1a, where A, B, and S stand for the presynaptic soma, the postsynaptic soma, and the synaptic site, respectively), although it may extend over a larger timescale. A reinforcement signal from B to S (mediated, e.g., by dopamine projections) modifying the strength of the relay S according to a temporally asymmetric rule may similarly select pathways through S that support a sequence of activities in the upper area. Assuming a variety of delay lines with stochastic transmission, our analysis predicts that after repeated presentation of the visual stimulus sequence, the total forward delay through S will eventually match the time difference between the cortical f populations A and B, DAS C DSB D | 4t|. This is possible even in the presence b of a delay DSB of the reinforcement signal. In this case, the forward delays are f
selected such that in addition, DSB C DbSB D a holds where a is the peak in the learning function of the relay S (cf. also Figure 5).
thalamocortical feedback loop (Petri & Mishkin, 1994; Rolls & Treves, 1998). Such learning is driven by an external input and probably internally involves a delayed reinforcement signal. Let us assume that looking in sequence to two notes, say a ¡ b, the visual input eventually evokes through the parietal pathway a well-timed activity at sites A and B, respectively, in the motor cortex (see Figure 8a; cf. also Ungerleider, 1995). During the many repetitions of the nger sequence, the basal ganglia are taught by the motor
612
Walter Senn, Martin Schneider, and Berthold Ruf
cortex to store a mirror image of the cortical spatiotemporal activity pattern. The input from the basal ganglia to the motor cortex, which correctly predicts the ongoing visual input, may evoke a reinforcement signal back to the basal ganglia, for instance, through dopamine projections, which strengthens the appropriate pathways through a temporally asymmetric rule. After learning, the motor sequence can be retrieved in the habit mode by a sequence of look-ups from the basal ganglia, thereby freeing motor cortex for other, higher-level tasks. The temporal structure between the two cortical populations and the corresponding memory trace in the basal ganglia is similar to that between a pre- and postsynaptic neuron and the connecting synapses (see Figure 8b). Since the reinforcement signal intrinsically has a delay beyond the temporal precision of the motion, an additional gating mechanism is required that takes this delay into account. Our analysis shows that a temporally asymmetric learning rule is enough to solve this timing problem. Put into a general framework, the problem is to adjust the delay from A to B, given that the tuning mechanism is located at some intermediate site S, and hence receives itself only delayed signals from A and B. It is not evident how such a delay adaptation can work at all. Summarized, the constraints are: Implicitness: No explicit mechanism exists for a directed adaptation of the individual delays. Rather, the average delay of different nearby delay lines can only implicitlybe adapted by strengthening or weakening the connections in the presence of slow, unbiased delay uctuations. Locality:There is no view that tells the time elapsed between the events at the two sites. Rather, only the time differences between a forward and a backward signal can be locally measured at the site S in between the two signal sources. Directionality: In adapting the delay line based on the local time differences, one should take into account the backward delay from site B to site S and the fact that this backward delay may be different from the corresponding forward delay. Recurrence:The change in the connection strengths may itself inuence the activity at location B and thus change the statistics of the temporal relationship between the signals at the two locations. As we showed, a temporally asymmetric learning rule may simultaneously cope with all these constraints. 7.4 Comparison with Recent Theoretical Work. The apparent importance of temporal relationships among cortical activities motivated different studies on delay adaptations. Huning ¨ et al. (1998) suggested that synaptic delays should adapt proportionally to the negative temporal derivative of the EPSP with peak centered at the postsynaptic spike time. Such an explicit
Activity-Dependent Development of Axonal and Dendritic Delays
613
delay adaptation, which assumes the temporal modications of intracellular messenger cascades, however, is difcult to motivate, apart from the fact that the effective synaptic delays measured in central synapses are in the range of 0.4 ms (Lloyd, 1960), a number that is down-corrected in a later study to 0.17 ms (Munson & Sypert, 1979). Other works suggested that delay changes should have qualitatively the same form as the learning function for the synaptic weights (Eurich, Cowan, & Milton, 1997). If deduced from the modication of the synaptic strength, however, the rule for the delay adaptation is not given by the derivative of the EPSP and is not identical to the original weight modication. In contrast, as our analysis shows, the best motivated form for an explicit delay adaptation is given by the derivative of the original weight modication function (cf. Figure 2a). In a series of other works, delay modications are suggested to result from two independent processes: delay selection through weight modication and delay shifts through explicit delay adaptation (Eurich, Ernst, & Pawelzik, 1998; Eurich et al., 1999). However, no physiological evidence for an explicit delay adaptation exists so far, and, as we argue, no such rule is in fact necessary. In the presence of weak delay uctuations and stochastic synaptic transmissions, a temporally asymmetric synaptic weight modication is enough to explain axonal and dendritic delay shifts. While the adaptation of the total forward delay occurs independent of the peak in the learning function, this peak implicitly determines the width of the somatic EPSP. Hence, during development, the rule controls axonal delays and dendritic latencies to support stimulus-driven temporal structures and generate different types of causal relationships between neuronal activities. Appendix A.1. Proof of Equation 4.3. Dening the normalized density p (d) D (d ) (d p , t) D whwi,t of connections from neuron B to neuron A with relative delay d we may write
Z dD
C1
¡1
dp(d) dd ´ hdp(d) i,
where by denition h.i denotes the integral over d. Inserting the expansion
y ( 4t C d) ¼ y (4 t C d) C (d ¡ d) y 0 (4 t C d)
(A.1)
into the rule 3.10 and averaging with respect to d, we get P ¼ hwiy (4t C d) C hw(d ¡ d) iy 0 ( 4t C d) hwi ½ 2 ¾ @ w 2 ¡ c1 hwi ¡ c2 hwi C 2 . @d 2
(A.2)
614
Walter Senn, Martin Schneider, and Berthold Ruf
This approximation is good if we assume that the width s of the delay distribution is narrow compared to the width 2(a C & ) of y . Evaluating the 2 integral in the last term of the right-hand side, we nd h @@dw2 i D 0, since we may assume that the derivative @@wd at the boundaries d D § 1 vanishes. Since hwdi hw(d ¡ d) i D hwdi ¡ hwid D hwi ¡ hwid D 0, hwi the second term in equation A.2 cancels as well. Multiplying the remainder of equation A.2 with d, we get P ¼ dhwiy ( 4t C d) ¡ dhwi(c1 hwi C c2 ) . dhwi
(A.3)
By equation 3.10 and the Taylor expansion, equation A.1, we also approximate after averaging P ¼ hdwiy ( 4t) C (hd 2 wi ¡ hdwid) y 0 ( 4t ) hd wi ½ 2 ¾ @ (d w) ) C C ¡ hdwi(c1 hwi c2 . 2 @d 2
(A.4)
Again, the last term on the right-hand side vanishes by partial integration §1 with boundary conditions w§ 1 D @w@d D 0. The factor in the second term reduces to hd 2 wi ¡ hdwid D
hd 2 wi hdwi 2 hwi ¡ dhwi D (d 2 ¡ d ) hwi D s 2 hwi, hwi hwi
where s 2 is the variation of the delay distribution given by the density p (d) . Using hdwi D dhwi, equation A.4 simplies to P ¼ dhwiy ( 4t C d) C s 2 hwiy 0 (4 t C d) ¡ dhwi(c1 hwi C c2 ) . hd wi
(A.5)
Subtracting equation A.3 from A.5, we obtain P ¡ dhwi P ¼ hwis 2 y 0 (4 t C d), hd wi and combining this with equation 4.2, we get the desired rule, equation 4.3, governing the adaptation of the average relative delay. A.2 Proof of Equation 4.4. We rst show that the weight distribution is nonsingular. To this end, we derive the differential equation for s 2 and 2 show that for small s 2 , we always have ds > 0. Substituting equation 3.10 dt dw for dt , one calculates
d 2 d 2 d 2 (d ¡ d ) D s D dt dt dt
³
hd 2 wi hdwi2 ¡ hwi hwi2
D 22 C (d 2 y ¡ d 2 y ) C 2d (d y ¡ d y ) ,
´ D ¢¢¢ D
Activity-Dependent Development of Axonal and Dendritic Delays
615
where the overlining denotes the average with respect to the normalized density p (d) . At the course of this calculation, we used that by partial in§1 2 tegration with boundary conditions w §1 D @w@d D 0, one nds h @@d 2 wi D 2
2
hd @@d2wi D 0 and hd 2 @@d 2 wi D 2hwi. Assuming that s 2 ! 0, the density p would become a delta function, and the average would commute with the product. Hence we would also have (d 2 y ¡ d 2 y ) ! 0 and (d y ¡ d y ) ! 0, and therefore dtd s 2 ! 22 . But this contradicts the above assumption, and we conclude that in fact, s 2 cannot shrink to zero. We can now t the nonsingular steady-state distribution with the gaus2
(d¡d) sian wQ (d) D w (d) exp(¡ 2s 2 ) . Inserting wQ for w in equation 3.10 and evalp uating atp d and d § s yields 0 D y (¡ a2 C &2 ) ¡ c1 hwi Q ¡ c2 ¡ 2 / s 2 and 2 Q ¡ c2 . Eliminating c1 hwi Q C c2pand using the 0 D y (¡ a C & § s) ¡ c1 hwi p p (¡ a C & ) ¡ y (¡ a C & § s ) ¼ ¡y 00 (¡ a2 C & 2 ) s 2 / 2 approximation y p gives y 00 (¡ a2 C & 2 )s 2 / 2 C 2 / s 2 D 0, from which equation 4.4 follows. Finally, the total synaptic weight at the steady state can be qualitatively 1 approximated by hwi Q ¼ (c e 2 ¡ 2 / s 2 ) / c1 with s given by equation 4.4. Since p hwi Q pD s 2p w (d), the maximal weight at steady state is roughly w (d) ¼ y (¡
a2 C &2 ) ¡2 / s 2 p . c1 s 2p
A.3 Proof of Equation 5.5. For xed presynaptic delays, we have, ac-
cording to equation 5.3, dDpost dtB dtB dD dD ¼ (1 C R0 ( D ) ) D D , dt dt dt dD dt
(A.6)
where D is the average forward dendritic delay and R ( D) ¼ R (D ) the average rise time. Since we are interested in replacing the factor dD / dt in equation A.6 by a term containing the learning function y± , we rst replace the change in D by a change in B from where we bridge to y± . Thus, we rst write dD D dt
³
´¡1
dB dD
dB ( D) . dt
(A.7)
In the same way as we deduced equation 4.3 in Section A.1 (note that there d4tsyn
(4
t Cd) we have dt D d dt ) we may approximate the change of the D dd dt 4 average time difference tsyn D tA C Dpre ¡ (tB C B ) at the synaptic site by
d4 tsyn dt
D ¡
dtB dB ¡ ¼ sB2 y±0 ( 4t) , dt dt
(A.8)
616
Walter Senn, Martin Schneider, and Berthold Ruf
where sB is the standard deviation of the distribution of B ( D ). Solving equation A.8 for dB , identifying dt dD ¼ dt
³
dtB ¡ sB2 y±0 dt
dB dt
´³
¼
dB( D ) , and inserting into equation A.7 yields dt
dB ( D ) dD
´¡1
.
(A.9)
Inserting equation A.9 into A.6 and solving for tB , we obtain dtB B0 ¡1 (1 C R0 ) ¼ ¡sB2 y±0 . 1 C B0 C R0 dt Estimating the standard deviation of the backward delay, sB , by the one for the forward dendritic delay, sB ¼ B0 ( D) s, yields the desired formula, equation 5.5. Acknowledgments
This study was supported by the Swiss National Science Foundation (grant 21-57076.99) and the Silva Casa foundation (W. S.). W. S. thanks Stefano Fusi for helpful discussions. References Abbott, L. & Song, S. (1999). Temporally asymmetric Hebbian learning, spike timing and neuronal response variability. In M. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11. Cambridge MA: MIT Press. Abbott, L., Varela, J., Sen, K., & Nelson, S. (1997).Synaptic depression and cortical gain control. Science, 275, 220–224. Agmon-Snir, H., & Segev, I. (1993). Signal delay and input synchronization in passive dendritic structures. J. Neurophysiol., 70(5), 2066–2085. Berger, T., Larkum, M., & Luscher, ¨ H.-R. (2001). A high Ih channel density in the distal apical dendrite of layer V neocortical pyramidal cells increases bidirectional attenuation of EPSPs. J. Neurophysiol., 85, 855–868. Bi, G., & Poo, M. (1998). Synaptic modications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neuroscience, 18(24), 10464–10472. Bi, G., & Poo, M. (1999). Distributed synaptic modications in neural networks induced by patterned stimulation. Nature, 401, 792–796. Bliss, T., & Collingridge, G. (1993). A synaptic model of memory: Long term potentiation in the hippocampus. Nature, 361, 31–39. Bringuier, V., Chavane, F., Glaeser, L., & FrÂegnac, Y. (1999). Horizontal propagation of visual activity in the synaptic integration eld of area 17 neurons. Science, 283, 695–699. Carr, C., & Friedman, M. (1999). Evolution of time coding systems. Neural Computation, 11, 1–20.
Activity-Dependent Development of Axonal and Dendritic Delays
617
Cowan, A., & Stricker, C. (1999). Long-term synaptic plasticity in the granular layer of rat barrel cortex. Society for Neuroscience, Abstracts, no. 689.7. Crair, M., & Malenka, R. (1995). A critical period for long-term potentiation at thalamocortical synapses. Nature, 375, 325–328. Engert, F., & Bonhoeffer, T. (1999). Dendritic spine changes associated with hippocampal long-term synaptic plasticity. Nature, 399, 66–70. Eurich, C., Cowan, J., & Milton, J. (1997). Hebbian delay adaptation in a network of integrate-and-re neurons. In W. Gerstner, A. Germond, M. Hasler, & J.-D. Nicoud (Eds.), Articial Neural Networks—ICANN’97 (pp. 157–162). Berlin: Springer-Verlag. Eurich, C., Ernst, U., & Pawelzik, K. (1998). Continuous dynamics of neuronal delay adaptation. In L. Niklasson, M. BodÂen, & T. Ziemke (Eds.), Articial Neural Networks—ICANN’98 (pp. 355–360). Berlin: Springer-Verlag. Eurich, C., Pawelzik, K., Ernst, U., Cowan, J., & Milton, J. (1999). Dynamics of self-organized delay adaptation. Phys. Rev. Lett., 82(7), 1594–1597. Feldman, D. (2000). Timing-based LTP and LTD at vertical inputs to layer II/III pyramidal cells in rat barrel cortex. Neuron, 27, 45–56. Feldman, D., Nicoll, R., Malenka, R., & Isaac, J. (1998). Long-term depression at thalamocortical synapses in developing rat somatosensory cortex. Neuron, 21, 347–357. Gerstner, W. (1993).Why spikes? Hebbian learning and retrieval of time-resolved excitation patterns. Biol. Cybernetics, 69, 503–515. Gerstner, W., Kempter, R., van Hemmen, J., & Wagner, H. (1996). A neuronal learning rule for sub-millisecond temporal coding. Nature, 383, 76–78. Gomez, T., & Spitzer, N. (1999). In vivo regulation of axon extension and pathnding by growth-cone calcium transients. Nature, 397, 350–355. Hopeld, J., & Brody, C. (2000). What is a moment? Cortical sensory integration over brief intervals. PNAS, 97, 13919–13924. Hopeld, J., & Brody, C. (2001). What is a moment? Transient synchrony as a collective mechanism for spatiotemporal integration. PNAS, 98, 1282–1287. Huning, ¨ H., Glunder, ¨ H., & Palm, G. (1998). Synaptic delay learning in pulsecoupled neurons. Neural Computation, 10, 555–565. Katz, L., & Shatz, C. (1996). Synaptic activity and the construction of cortical circuits. Science, 274, 1133–1138. Kempter, R., Gerstner, W., & van Hemmen, J. (1999). Spike-based compared to rate-based Hebbian learning. In M. S. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11. Cambridge, MA: MIT Press. Kuhn, ¨ R., & van Hemmen, J. (1991). Temporal association. In E. Domany, J. van Hemmen, & K. Schulten (Eds.), Model of neural networks (pp. 213–280). Berlin: Springer-Verlag. Larkum, M., Zhu, J., & Sakmann, B. (1999). A novel cellular mechanism for coupling inputs arriving at different cortical layers. Nature, 398, 338– 341. Lloyd, D. (1960). Section I: Neurophysiology. In H. Magoun (Ed.), Handbook of physiology (vol. 2, pp. 929–949). Washington, DC: American Physiology Society.
618
Walter Senn, Martin Schneider, and Berthold Ruf
Magee, J., & Cook, E. (2000). Somatic EPSP amplitude is independent of synapse location in hippocampal pyramidal neurons. Nature Neuroscience, 3(9), 895– 903. Maletic-Savatic,M., Malinow, R., & Svoboda, K. (1999).Rapid dendritic morphogenesis in CA1 hippocampal dendrites induced by synaptic activity. Science, 283, 1923–1927. Markram, H., Lubke, ¨ J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by concidence of postsynaptic APs and EPSPs. Science, 275, 213–215. Markram, H., Pikus, D., Gupta, A., & Tsodyks, M. (1998). Potential for multiple mechanisms, phenomena and algorithms for synaptic plasticity at single synapses. Neuropharmacology, 37, 489–500. Markram, H., & Tsodyks, M. (1996). Redistribution of synaptic efcacy between neocortical pyramidal neurons. Nature, 382, 807–810. McCormick, D. (1991). Functional properties of slowly inactivating potassium current in guinea pig dorsal lateral geniculate relay neurons. J. Neuroscience, 66, 1176–1189. Munson, J., & Sypert, G. (1979). Properties of single bre excitatory postsynaptic potentials in triceps surae motoneurones. J. Physiol. (London), 296, 329–342. Natschl¨ager, T., & Ruf, B. (1998). Spatial and temporal pattern analysis via spiking neurons. Network: Computation in Neural Systems, 9, 319–332. Nieuwenhuys, R. (1994). The neocortex: An overview of its evolutionary development, structural organization and synaptology. Anat Embryol., 190, 307– 337. Petri, H., & Mishkin, M. (1994). Behaviorism, cognitivism, and the neurophysiology of memory. American Scientist, 82, 30–37. Polleux, F., Morrow, T., & Glosh, A. (2000). Semaphorin 3A is a chemoattractant for cortical apical dendrites. Nature, 404, 567–573. Roberts, P. (1999).Computational consequences of temporally asymmetric learning rules: I. Differential Hebbian learning. J. Computational Neuroscience, 7, 235–246. Rolls, E., & Treves, A. (1998). Neural networks and brain functions. Oxford: Oxford University Press. Segev, I., Rapp, M., Manor, Y., & Yarom, Y. (1992). Analog and digital processing in single nerve cells: Dendritic integration and axonal propagation. In T. McKenna, J. Davis, & S. Zornetzer (Eds.), Single neuron computation (pp. 173– 198). Orlando, FL: Academic Press. Senn, W., Segev, I., & Tsodyks, M. (1998). Reading neural synchrony with depressing synapses. Neural Computation, 10, 815–819. Senn, W., Tsodyks, M., & Markram, H. (2001).An algorithm for modifying neurotransmitter release probability based on pre-and post-synaptic spike timing. Neural Computation, 13(1), 35–68. Sompolinsky, H., & Kanter, I. (1986). Temporal association in asymmetric neural networks, I. Phys. Rev. Lett., 57, 2861–2864. Song, S., Miller, K., & Abbott, L. (2000). Competitive Hebbian learning through spike-timing dependent synaptic plasticity. Nature Neuroscience, 3, 919– 926.
Activity-Dependent Development of Axonal and Dendritic Delays
619
Stuart, G. J., & Sakmann, B. (1994). Active propagation of somatic action potentials in cerebellar Purkinje cells. Nature, 367, 69–72. Toni, N., Buchs, P.-A., Nikonenko, I., Bron, C., & Muller, D. (1999). LTP promotes formation of multiple spine synapses between a single axon terminal and a dendrite. Nature, 402, 421–425. Tsodyks, M., & Markram, H. (1996). Plasticity of neocortical synapses enables transitions between rate and temporal coding. In C. von der Malsburg (Ed.), Proceedings of the ICANN’96 (pp. 445–450). Berlin: Springer-Verlag. Ungerleider, L. (1995). Functional brain image studies of cortical mechanisms for memory. Science, 270, 769–775. Zhang, L., Tao, H., Holt, C., Harris, W., & Poo, M. (1998). A critical window in the cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44. Received May 31, 2000; accepted May 24, 2001.
LETTER
Communicated by Paul Tiesinga
Impact of Geometrical Structures on the Output of Neuronal Models: A Theoretical and Numerical Analysis Jianfeng Feng
[email protected] Computational Neuroscience Laboratory, Babraham Institute, Cambridge CB2 4AT, U.K., and COGS, University of Sussex at Brighton, BN1 9QH, U.K. Guibin Li
[email protected] Computational Neuroscience Laboratory, Babraham Institute, Cambridge CB2 4AT, U.K. What is the difference between the efferent spike train of a neuron with a large soma versus that of a neuron with a small soma? We propose an analytical method called the decoupling approach to tackle the problem. Two limiting cases—the soma is much smaller than the dendrite or vica versa—are theoretically investigated. For both the two-compartment integrate-and-re model and Pinsky-Rinzel model, we show, both theoretically and numerically, that the smaller the soma is, the faster and the more irregularly the neuron res. We further conclude, in terms of numerical simulations, that cells falling in between the two limiting cases form a continuum with respect to their ring properties (mean ring time and coefcient of variation of inter-spike intervals). 1 Introduction
It is well documented in the literature that the geometrical structure of a neuron considerably contributes to its information processing capacity (Koch, 1999). However, a fully detailed model is usually hard to study theoretically, and the nonhomogeneous distribution of ionic channels along dendritic trees prevents such an investigation. Hence, devising methods for approximating detailed biophysical models by simple models, which preserve the essential complexity of the detailed biophysical mechanism yet are simultaneously concise and transparent, is an important continuing task in computational neuroscience. The advantages are obvious. Detailed biophysical modes are usually difcult to understand and can be extremely time-consuming to simulate. Thus, two-compartment models, which mimic detailed biophysical neuron models of hundreds of compartments, are proposed and studied in Mainen and Sejnowski (1996) and Pinsky and Rinzel (1994). Their results, however, are exclusively numerical Neural Computation 14, 621–640 (2002)
° c 2002 Massachusetts Institute of Technology
622
Jianfeng Feng and Guibin Li
and primarily based on models with deterministic inputs as well. The main questions we address in this article are: Can we, at least for some extreme cases, theoretically understand the behavior of two-compartment models? Such a study will provide new insights into the problem of understanding the function of neuronal morphology. Two-compartment models with deterministic inputs have been extensively studied in Mainen and Sejnowski (1996) and Pinsky and Rinzel (1994). A more interesting and theoretically more challenging problem is to investigate the model behavior with stochastic inputs since the stochastic part of an input signal might play a functional role in processing information (see Collins, Chow, & Imhoff, 1995; Feng, 1997; Harris & Wallpert, 1998; and Sejnowski, 1998). We propose a novel theoretical approach, which we call a decoupling approach, to investigate the properties of two-compartment models, in particular, how the somatic size affects its output when the neuron receives stochastic inputs from the dendritic compartment. The theory is general enough to be applied to both abstract models, such as the two-compartment integrate-and-re (IF) model, and biophysical models, such as the PinskyRinzel model. When the somatic compartment is small, the model tends to burst, and the bursting length is totally determined by the activity of dendritic compartment. In other words, the somatic compartment is only a reporter of the dendritic compartment activity. When the somatic compartment is large, the model can also be reduced to a single compartment. For both cases, the problem of studying a complicated dynamical system of two-coupled variables is reduced to a much simpler dynamical system of a single variable—hence, the name decoupling approach. As a consequence, all theoretical results developed in the literature for single-compartment models can be applied. For example, we could theoretically estimate the bursting length and ring rates of the two-compartment IF model. Our conclusions are obtained for generic two-compartment models. The decoupling approach serves as a bridge between one-compartment and two-compartment models. For both the IF and the Pinsky-Rinzel model, in terms of numerical simulations, we further conclude that cells falling in between the two limiting cases of large soma and small soma form a continuum with respect to their ring properties (mean ring time and coefcient of variation of interspike intervals). Hence, we grasp the ring properties of the models of general cases once we know that the two limiting cases could be tackled by a theoretical approach. The paper is organized as follows. In section 2, the two-compartment IF model and Pinsky-Rinzel model are dened. Section 3 is devoted to theoretically analyzing generic two-compartment models. Section 4 presents
Impact of Geometrical Structures
623
numerical results. As an application of results of the previous sections, in Feng and Li (2001) we take into account two-compartment models with signal inputs (nonhomogeneous inputs). Analytically we prove that when the somatic compartment is small enough, the two-compartment IF model could naturally be a slope detector via its bursting activity. Numerically, we show that the conclusion is also valid for Pinsky-Rinzel model. 2 Models
Let us assume that a neuron is composed of two compartments: a somatic compartment and a dendritic compartment. Suppose that a cell receives excitatory postsynaptic potentials (EPSPs) at qE excitatory synapses and inhibitory postsynaptic potentials (IPSPs) at qI inhibitory synapses and that Vs (t ) and Vd ( t) are the membrane potential of the soma and dendritic compartments at time t, respectively. The following two kinds of two-compartment models are considered in this article. 2.1 Two-Compartment IF Model. When the somatic membrane potential Vs ( t) is between the resting potential Vrest and the threshold Vthre ,
8 1 Vd ( t) ¡ Vs (t ) > > dt V ( t) ¡ Vd ( t) > :dVd (t ) D ¡ (Vd (t) ¡ Vrest ) dt C gc s dt C , c 1¡p 1 ¡p
(2.1)
where 1 /c is the decay rate and p is the ratio between the membrane area of the somatic compartment and the whole cell. gc > 0 is a constant, and the synaptic input is isyn (t ) D a
qE X iD 1
dEi ( t) ¡ b
qI X
dIj ( t) ,
jD 1
where Ei (t ) , Ii ( t) are Poisson processes with rates lE and lI , respectively, and a,b are the magnitudes of each EPSP and IPSP. After Vs ( t) crosses Vthre from below, a spike is generated, and Vs ( t) is reset to Vrest . This model is called the two-compartment IF model. The interspike interval of efferent spikes is T ( p) D infft: Vs ( t) ¸ Vthre g for 1 > p > 0. It is well known that Poisson input can be approximated by Isyn ( t) D m t C sBt ,
624
Jianfeng Feng and Guibin Li
where Bt is the standard Brownian motion, m D aqE lE ¡ bqI lI and s D p a2 qE lE C b2 qI lI . Thus, equation 2.1 can be approximated by 8 1 Vd (t ) ¡ Vs ( t) > > dt V ( t) ¡ Vd ( t) > :dV d ( t) D ¡ ( Vd ( t) ¡ Vrest ) dt C gc s . dt C c 1¡p 1¡p
(2.2)
In the following, we consider the two-compartment IF model to be the model dened by equation 2.2. 2.2 Pinsky-Rinzel Model. We also consider a simplied, twocompartment biophysical model proposed by Pinsky and Rinzel (1994). They have demonstrated that the model mimics a full, very detailed Traub model quite well. The Pinsky-Rinzel model is dened by
8 > Cm dV s ( t) D ¡ILeak ( Vs ) dt ¡ INa (Vs , h ) dt ¡ IK¡DR ( Vs , n ) dt > > > > Vd ( t) ¡ Vs ( t) > > C gc dt > > > p < Cm dV d ( t) D ¡ILeak ( Vd ) dt ¡ ICa ( Vd , s) dt ¡ IK¡AHP ( Vd , q ) dt > > > dISyn Vs ( t) ¡ Vd ( t) > > C gc ¡IK¡C ( Vd , Ca, c) dt C dt > > > 1¡p 1¡p > > : 0 Ca D ¡0.002ICa ¡ 0.0125Ca.
(2.3)
In our calculations, all parameters and equations are identical to those used in Pinsky and Rinzel (1994) except for the parameters of the calcium equation, which are from Wang (1998). 3 Analytical Results
In this section, we consider two limiting cases: soma of innitely small size and the limit of small dendrite. For these two cases, analytical results are derived that give the mean interspike intervals as well as the variance and the coefcient of variation (CV). The analysis implies that neurons with small soma re more irregularly than neurons with a large soma. Also, small soma models show intermittent bursting; the interburst intervals and burst lengths are calculated. Without loss of generality, we assume that gc D 1 in the following discussion.
Impact of Geometrical Structures
625
3.1 Decoupling Approach. Let us now consider a generic two-
compartment model dened by 8 Vd ( t) ¡ Vs ( t) > > dt V ( t) ¡ Vd ( t) > :dVd (t ) D g ( Vs , Vd ) dt C s dt C , 1¡p 1¡p
(3.1)
where f, g are two functions. Equation 3.1 certainly includes the PinskyRinzel model as a special case. p ! 0 Multiplying both sides of the equation for the somatic compartment by p and taking p ! 0, we see that Vd D Vs (for the existence of the limit, see appendix B). At the same time, the equation for the dendritic compartment becomes Case 1.
dV d D g ( Vd , Vd ) dt C dI syn (t) ,
(3.2)
since p ! 0 and Vd D Vs . p ! 1 Multiplying both sides of the equation for the dendritic 0 compartment by 1 ¡ p and taking p ! 1, we see that Vd ¡ Vs D Isyn . At the same time, the equation for the somatic compartment becomes Case 2.
0 ) dt C dIsyn ( t) . dV s D f (Vs , Vs C Isyn
(3.3)
We emphasize that the derivation of equations 3.3 and 3.2 is independent of the concrete form of synaptic inputs. Hence, the conclusions hold true for more biologically based inputs, such as AMPA, NMDA, GABAA , and GABAB (Destexhe, Mainen, & Sejnowski, 1998). Summarizing the results above, we arrive at the following theorem: Theorem 1 (Decoupling Theorem ).
For a two-compartment model dened by equation 3.1, we have the following conclusions: When p ! 0, we have Vs D Vd and dVd D g (Vd , Vd ) dt C dIsyn ( t). When p ! 1, the model is reduced to
0 ) dt C dIsyn ( t) . dVs D f ( Vs , Vs C Isyn
3.2 Firing Patterns. Now we can apply theorem 1 to the twocompartment IF and Pinsky-Rinzel models. Let us consider the twocompartment IF model rst.
626
Jianfeng Feng and Guibin Li
p ! 0 The somatic membrane potential is identical to that of the dendritic compartment. Suppose that Vd ( t) > Vthre for t 2 [t0 , t1 ]. Then Vs ( t) will re very fast during [t0 , t1 ]. In theory, it res with an innite frequency. This cannot happen as a real neuron, which has a refractory period. On the other hand, if Vd ( t) < Vthre for t 2 [t0, t1 ], then Vs ( t) D Vd < Vthre , and the neuron is silent. The analysis above for p D 0 gives rise to bursting behavior. During the interval [t0 , t1 ], when Vd > Vthre , the cell res with a very high frequency; when [t0, t1 ] with the property that Vd < Vthre , the cell is completely silent. Figure 1 provides an intuitive picture of the ring pattern. The somatic compartment reports the activity of the dendritic compartment: it bursts whenever the membrane potential of the dendritic compartment is above the threshold and is completely silent if the membrane potential of the dendritic compartment is below the threshold. Case 1.
p ! 1 The two-compartment model behavior is the same as that of the conventional IF model which is well studied in the literature (Brown, Feng, & Feerick, 1999; Feng, 1997, in press). When c m > Vthre , the output spike trains are generally regular with a CV smaller than 0.5. In this case, instances where the threshold is crossed are mainly due to deterministic forces. When c m < Vthre , the output spike trains are irregular, with a CV greater than 0.5. Also, the threshold crossings are primarily due to random uctuations (refer to Figure 2 for an intuitive illustration). The issue on how to generate spike trains with large CV has been extensively discussed in the literature (Bernander, Douglas, & Koch, 1992; Destexhe & Pare, 2000; Shadlen & Newsome, 1998; Schneidman, Freedman, & Segev, 1998; Troyer & Miller, 1997). Nevertheless, it seems that all computational models are concentrated on synaptic inputs or biophysical neurons. Our results here provide a transparent, geometrical picture on how to generate spike trains with a large CV. Case 2.
Similar arguments can be applied to Pinsky-Rinzel model. When p ! 1, the model is reduced to a somatic compartment. When p ! 0, if the neuron res, it will re with its highest possible frequency, provided that Vd ( t) is greater than the threshold of the somatic compartment membrane potential. 3.3 Statistical Characteristics. For the case of small p, it is not very informative to calculate the interspike intervals. It is more interesting to consider the interburst intervals Ti and burst length Tl . The Ti and Tl are dened as follows:
(
Ti D supft: V0 D Vthre , Vd (s ) · Vthre , for 0 · s · tg Tl D supft: V0 D Vthre , Vd (s ) ¸ Vthre , for 0 · s · tg.
(3.4)
Impact of Geometrical Structures
627
Figure 1: Membrane potential of the two-compartment IF model with p D 0.05. Note that the membrane potentials of the soma and the dendrite are almost identical, except the reset of the somatic compartment. Compare with Figure 2.
Note that the cases where Vs for p D 1 and Vd for p D 0 are very different when t is small, although they appear to be identical. Vs for p D 1 is not a stationary process, but Vd for p D 0 is a stationary process. Combining the arguments above and dening q CV ( p) D
hT ( p ) 2 ¡ (hT ( p) i) 2 i/ hT ( p ) i,
we arrive at the following conclusions. For the two-compartment IF model, we have P (B i )hTi i » hT (0) i < hT (1) i, and CV (0) > CV (1) ,
(3.5)
where a rigorous and analytical expression of hT (1) i is presented in the d | following and Bi D fVd (0) D Vthre , dV < 0g. dt tD 0 Numerical examples for 0 < p < 1 are included in the next section. In the following, we discuss the cases of p D 1 and p D 0 separately. When p D 1, we consider the conventional one-compartment IF model. An analytical expression for the mean interspike intervals of the IF model has been obtained in terms of the solution of partial differential equations, as shown in Musila and LÂansk Ây (1994), Ricciardi and Sato (1990), and Tuckwell
628
Jianfeng Feng and Guibin Li
Figure 2: Membrane potential of the two-compartment IF model with p D 0.95. The sudden jumping up and down of the membrane potential between 0 mV (resting potential) and 20 mV (threshold) is due to the reset of the IF model.
(1988). However, such a formula usually involves some special functions and is in general not very informative; various approximations have to be found. In appendix C, we present a rigorous, analytical approach for calculating the mean ring time, which could be applied to any one-dimensional neuron model. Examples are the IF model, the h-neuron (Ermentrout, 1996), and the IF–Fitzhugh-Nagumo model, although here we conne ourselves to the IF model. Applying theorem 2 in appendix C to the IF model, we obtain "Z ³ ´ # y Vthre (x ¡ c m ) 2 | Vrest 2 hT (1) i D 2 exp dy s s 2c Vrest "Z ³ ´ # y Vrest (x ¡ c m ) 2 | Vrest ¢ exp ¡ dy s 2c ¡1 ³ ´ # (3.6) Z Vthre "Z Vthre ( x ¡ c m ) 2 | uVrest 2 C exp du s 2 Vrest s 2c y
³
¢ exp ¡
y
(x ¡ c m ) 2 | Vrest s 2c
´
dy.
Note that it is easier numerically to calculate the mean ring time using
Impact of Geometrical Structures
629
equation 3.6 than using the method in Tuckwell (1988). From equation 3.6, we see that when Vthre < c m , hT (1) i < 1 as s ! 0; when Vthre > c m , hT (1) i ! 1 as s ! 0. In terms of equation 3.6, we could further calculate, for example, the Fish information, and we will discuss it in subsequent publications. Now we turn to the case where p D 0. Solving the equation for the dendritic compartment, we obtain Vd (t ) ¡ c m (1 ¡ exp(¡t /c ) ) D Vd (0) exp(¡t /c ) C s
Z
t 0
exp(¡(t ¡ s) /c ) dBs .
Its covariance function is given by s 2c exp(¡|t | /c ) 2 for t 2 R. The burst frequency is identical to the upcrossing of Vthre for the stationary process Vd . By a direct application of Rice’s formula (see Leadbetter, Lindgren, & RootzÂen, 1983), we obtain the following conclusions. The bursting frequency is given by r (t ) D
³ ´ ³ ´ (V ( Vthre ¡ c m ) 2 1 p 00 ¡c m )2 s |r (0) | exp ¡ thre . D p exp ¡ 2p 2 2p c 2 Furthermore, using large deviation theory (Albeverio, Feng, & Qian, 1995; Feng, in press), we could estimate the mean interbursting intervals Ti when Vthre > c m and mean burst length Tl when Vthre < c m . The obtained results are approximations, and so we do not present them here. A rigorous and exact estimate would be of interest. Finally, we emphasize that there would never be a case where p D 0 or p D 1. All of our conclusions above are true when p is sufciently close to zero or sufciently close to one. As we have demonstrated, the decoupling approach reduces the problem of studying a complicated dynamical system of twocoupled variables to a much simpler dynamical system of single variable. All results on the single-compartment model developed in the literature can be applied. Moreover, the approach opens up many new theoretical problems for further study. In the next section, we consider models and behavior between these two extreme cases. 4 Simulation Studies
In this section, for both the IF model and the Pinsky-Rinzel model, we show that cells falling in between the two limiting cases (large soma or large dendrite) form a continuum with respect to their ring properties (mean and CV), that is, for CV and mean ring time, the gap between p D 0 and p D 1 is lled in a monotone manner with respect to p.
630
Jianfeng Feng and Guibin Li
4.1 Parameters and Computation Methods. We simulate the twocompartment IF model and Pinsky-Rinzel model with the following synaptic parameters: a D b D 0.5 mV, lE D 100 Hz, lI D 0, 10, ¢ ¢ ¢ , 100 Hz, qE D qI D 100. The threshold for detecting a spike for both the soma and dendrite of Pinsky-Rinzel model is 30 mV and gc D 2.1. In the two-compartment IF model, we let Vrest D 0, Vthre D 20 mV, c D 20.2, and gc D 4. All numerical simulations for the IF model are carried out with a step size 0.01, Euler scheme, and zero initial values for both Vs and Vd . For the Pinsky-Rinzel model, an algorithm for solving stiff equations from the Numerical Algorithm Group (NAG) library (D02NBF) is used with a step size of 0.01; initial values for Vs , Vd , Ca, n, h, s, c, q are ¡4.6, ¡4.5, 0.2, 0.001, 0.999, 0.009, 0.007, 0.01, respectively. The normal random numbers used in the simulation are generated by the NAG subroutine G05DDF. Further small step sizes are used, and we conclude that no signicant improvements are obtained. 4.2 Numerical Results. For the two-compartment IF model, shown in Figure 3, we see that both the mean ring frequency and the CV are increasing functions of p. Therefore, the gap between p D 0 and p D 1 described in the previous section is lled in a monotone manner with respect to p. The meaning ring time for p D 0.5 does not attain the maximum (see lemma 1 in appendix A). We also note that when p D .1, efferent spike trains are very irregular, even when inputs are exclusively excitatory. As we pointed out in the previous section, when p is small, the CV of efferent spike trains is quite high. For the Pinsky-Rinzel model, behavior similar to that of the IF model is observed. The gap between p D 0 and p D 1 is lled in a monotone way (see Figure 4) with respect to p. As we adjust the geometrical parameter p, the model exhibits a variety of behavior. In Figures 3 and 4, we also plot mean ring time and CV versus p for 0 < p < 1. From results in the previous section, we know that when p D 0 and p D 1, the two-compartment models are reduced to singlecompartment models; their behavior is well studied in the literature (Feng, in press). In conclusion, we have obtained a complete picture of the behavior of these two-compartment models. Both mean ring time and CV are monotone functions of p where p is the ratio of the somatic compartment area with respect to the whole cell. 5 Discussion
We have considered the behavior of two-compartment models as a rst step to understanding the impact of the geometrical structures of neurons on their input-output relationship. We nd that when the somatic compartment is large, a two-compartment model can be reduced to a single
Impact of Geometrical Structures
631
Figure 3: Mean ring time and CV versus lI for p D 0.1, 0.3, 0.5, 0.8 (upper panel) and mean ring time and CV versus p for r D lI / lE D 0.2, 0.5, 0.8 (bottom panel) for the two-compartment IF model. Ten thousand spikes are generated for calculating the mean ring time and CV. Parameters are specied in the text.
632
Jianfeng Feng and Guibin Li
Figure 4: Mean ring time and CV versus lI with p D 0.3, 0.5, 0.8 (upper panel) and mean ring time and CV versus p with r D lI / lE D 0.2, 0.5, 0.8 (bottom panel) for Pinsky-Rinzel model. Ten thousand spikes are generated for calculating the mean ring time and CV. Parameters are specied in the text.
Impact of Geometrical Structures
633
one-compartment model. When the somatic compartment is small, it exhibits bursting behavior, and its behavior is determined by the activity of the dendritic compartment alone. Therefore, in each case, the original coupled two-compartment model is reduced to a single-compartment model, which we call the decoupling approach. By a combination of theoretical analysis and numerical simulations, we conclude that both the rst- and second-order statistics (mean and CV) of interspike intervals are monotone functions of the somatic size. We then present theoretical results for calculating the mean interspike intervals and burst frequencies. We briey discuss the biological implications of changing p—what we call global plasticity. Subcellular morphology changes have been reported recently in a few types of neurons (Andersen, 1999; Engert & Bonhoeffer, 1999). The most striking and direct example of global plasticity is from vasopressin and oxytocin cells in the supraoptic and paraventricular nulcei. For both types of neurons, it is found that p changes during lactation (Stern & Armstrong, 1998). For example, in the oxytocin cell, a shrinkage of the dendritic tree and an enlargement of the soma is observed, accompanied by a considerable change in the electrical properties of the neuron. In the classical approach of learning theory, the Hebb learning rule plays an essential role. Based on it, many models are developed (Feng, in press). The Hebb learning rule is a local plasticity principle where increasing the activities presynaptically and postsynaptically induces an increase in the strength of this synapse. Due to the large number on synapses of each neuron, the functional consequence of the modication of the strength of a single synapse is bound to be limited. In contrast, a global change of the neuronal morphology 1 would be a much more efcient way to modify the input-output relationship. As we demonstrate here theoretically and numerically, an increase or a decrease in p alone causes a neuron to behave completely differently where both ring rate and the ring pattern (CV) vary. Furthermore, we have not taken into account the effect of input modications due to the pruning or elaboration of dendritic trees, as reported in experiments (Stern & Armstrong, 1998), which will more dramatically modify the input-output relationship. Several challenging problems remain. We must determine whether global plasticity can be observed experimentally in other neurons and, if so, for which stimuli the global plasticity occurs. We may also be able to apply global plasticity to the design of articial neural networks that achieve a more powerful and efcient learning. It would also be interesting to consider changing p on the synchronization properties of a group of neurons. We want to emphasize that here we exclusively focus on the impact of minimal change of geometrical structures on the input-output relationship of a neuronal model. There are many other factors that might interfere with
1 We do not exclude the possibility that modifying a cell’s biophysical properties can result in similar effects as morphology changes.
634
Jianfeng Feng and Guibin Li
the global plasticity. For example, a neuromodulator such as acetylcholine might play a similar role as the changing of p, but it has a shorter timescale in comparison with the change of geometrical structures. Finally, we examine the scaling in synaptic inputs in the models considered in this article. All synaptic inputs are scaled by a factor of 1 / (1 ¡ p) (see equations 2.1 and 2.2). Therefore, when p ! 1, according to results in section 3, we know that a two-compartment is reduced to a single-compartment model with an input Isyn . The situation is totally different if the synaptic input is not scaled. By this, we mean that the two-compartment IF and Pinsky-Rinzel models are
8 1 Vd (t ) ¡ Vs ( t) > > dt > :dV d ( t) D ¡ ( Vd ( t) ¡ Vrest ) dt C gc dt C dISyn ( t) c 1¡p
(5.1)
8 Cm dV s ( t) D ¡ILeak ( Vs ) dt ¡ INa (Vs , h ) dt ¡ IK¡DR ( Vs , n ) dt > > > > Vd ( t) ¡ Vs ( t) > > C gc dt > > > p < Cm dV d ( t) D ¡ILeak ( Vd ) dt ¡ ICa ( Vd , s) dt ¡ IK¡AHP ( Vd , q ) dt > > > Vs (t) ¡ Vd ( t) > > ¡IK¡C ( Vd , Ca, c) dt C dISyn ( t) C gc dt > > 1¡p > > : 0 Ca D ¡0.002ICa ¡ 0.0125Ca.
(5.2)
and
By applying the decoupling method in section 3 to a two-compartment without scaling synaptic inputs, we conclude that when p ! 1, the twocompartment model is reduced to a single-compartment model again, but with the input being identical to zero. In other words, now the model is completely silent. When p ! 0, the conclusion is the same as in section 3. In Figure 5, we plot numerical simulations for both the two-compartment IF model and the Pinsky-Rinzel model without scaling synaptic inputs. We see that qualitatively, all conclusions in this article remain true. In general, both the IF and the Pinsky-Rinzel models without scaling synaptic inputs re more slowly than the models with scaling synaptic inputs. This is easily understandable since synaptic input ISyn (t ) / (1 ¡ p) (with scaling) is stronger than ISyn ( t) (without scaling).
Impact of Geometrical Structures
635
Figure 5: Mean ring time and CV versus lI for p D 0.3, 0.5, 0.8 for the PinskyRinzel model (upper panel) and mean ring time and CV versus lI for p D 0.3, 0.5, 0.8 (bottom panel) for the two-compartment IF model. Ten thousand spikes are generated for calculating the mean ring time and CV. Parameters are specied in the text.
636
Jianfeng Feng and Guibin Li
Appendix A
Assuming Vs | t D 0 D Vd | t D 0 D 0, we solve the linear equation for the twocompartment IF model: 8 ¡1 ¢ C 1 > > (1 ¡ e c p(1¡p) t ) mc p > > > >Vs ¡ Vd D ¡ > p (1 ¡ p ) C c > > Z t ¡ ¢ > > 1 s > ( t¡x ) ¡ c1 C p(1¡p) > > ¡ e dBx > > 1 ¡ p > 0 > > > t > c mp C 2c 2 m > 1 (1 ¡ 2p)s > p (1 ¡ 2p)c m ¡ c1 C p(1¡p) t > > C C e > > p (1 ¡ p ) C c p (1 ¡ p ) 2 > > > Z t Z x ¡ ¢ > > ( x¡y ) ¡ 1C 1 > ¡ t¡x > £ e c e c p(1¡p) dBy dx > > > 0 0 > > Z > t > s > ¡ t¡x > C e c dBx . : 1¡p 0
(A.1)
Therefore, t t mc 2 mc p (1 ¡ p ) ¡ ct ¡ p (1¡p) ¡ mc e¡ c C e p (1 ¡ p ) C c p (1 ¡ p ) C c Z Z x¡y (1 ¡ 2p) s t ¡ t¡x x ¡ x¡y ¡ c C e e c p (1¡p) dBy dx 2 2p(1 ¡ p ) 0 0 Z t ± ² t¡x s ¡ t¡x ¡ C e c 1 ¡ e p(1¡p) dBx 2(1 ¡ p ) 0 c mp C c 2 m p2c m ¡t¡ t ¡t ¡c me c ¡ Vd D e c p(1¡p) (1 ¡ p ) C c p (1 ¡ p ) C c Z p Z x¡y (1 ¡ 2p) s t ¡ t¡x x ¡ x¡y ¡ p (1¡p) c c C e e dBy dx 2p(1 ¡ p ) 2 0 0 Z t ± ² t¡x s ¡ t¡x C e c 1 C e p(1¡p) dBx . 2(1 ¡ p ) 0
Vs D
By applying the martingale property of the integral measurable function f , we have the following identities: Lemma 1.
8 > > > t mc p C mc 2 mc p2 ¡t¡ t > > ¡ mc e¡ c ¡ e c p(1¡p) :hVd i D p (1 ¡ p ) C c p (1 ¡ p) C c
f dBt for any
(A.3)
Impact of Geometrical Structures
637
and 8 s 2c 3 s 2c ¡ c2 t > 2 > h(V ¡ hV i) i ¡ D e > s s > 2[p(1 ¡ p ) C c ][2p(1 ¡ p ) C c ] 2 > > > > > t 2s 2c p (1 ¡ p ) ¡ 2tc ¡ p(1¡p) > > C e > > > ) C 2p(1 ¡ c p > > > > 2t s 2c p (1 ¡ p ) ¡ 2tc ¡ p(1¡p) > > > ¡ e < 2[p(1 ¡ p ) C c ] 2c 3 (1¡ ) 2 C 8s 2c 2 (1¡ ) 2 C 4s 2c 2 (1¡ ) (3¡2 ) > 4s p p p p p p > > h(Vd ¡hVd i) 2 i D > > 2 > 8(1 ¡ p ) [p(1 ¡ p ) C c ][2p(1 ¡ p ) C c ] > > > 2 > s c 2s 2c p2 > ¡ 2t ¡ t ¡2 > > ¡ e ct¡ e c p (1¡p) > > ) C 2 2p(1 ¡ c p > > > > > s 2c p3 ¡ 2t ¡ 2t > : (A.4) ¡ e c p (1¡p) . 2(1 ¡ p )[p (1 ¡ p ) C c ] Appendix B
Denote Vs ( t, p ) and Vd (t, p ) as the solution of equation 3.1. We consider only the case of p ! 0. For any t > 0, the space C[0, t] of all continuous functions on time interval [0, t] is compact, with the metrics | |x|| D max[0,t] |x(t) |. Hence, it sufces to prove that for any subsequence Vs (t, pn ) and Vd ( t, pn ) with limn!1 pn ! 0, its limit is unique. Multiplying both sides of equation 3.1 for the somatic compartment by pn and taking pn ! 0, we see that limn!1 Vd ( t, pn ) D limn!1 Vs ( t, pn ) , provided that Vd and Vs are bounded. Now the equation for the dendritic compartment becomes dV d D g ( Vs , Vd ) dt C dIsyn ( t) D g ( Vd , Vd ) dt C dIsyn ( t) ,
(B.1)
and its solution is unique, under some reasonable conditions on g and Isyn. Hence, for any sequence pn with pn ! 0 its limit limn!1 Vd ( t, pn ) D limn!1 Vs ( t, pn ) is unique, satisfying equation B.1. This completes our proof. Appendix C
We rst consider a diffusion process dened by dX t D m (Xt ) dt C s (Xt ) dBt . Let us introduce the following quantities: ³ Z x ´ 8 2m ( y) > >s ( x ) D exp ¡ dy > 2 < 0 s ( y) ±R ² ( z) > exp 0x 2m 2 ( z ) dz > 1 s > :m ( x ) D D , s 2 ( x) s (x ) s 2 (x )
(C.1)
(C.2)
638
Jianfeng Feng and Guibin Li
where m is the speed density and s Ris the scale function. We say that a diffu1 sion process is positive recurrent if ¡1 m ( x) dx < 1, which is equivalent to T < 1 where T is the rst exit time of Vthre . For a positive-recurrent process, the stationary distribution density is given by p ( x ) » m (x ) . For a positive-recurrent diffusion process Xt ,
Theorem 2.
Z hTi D 2
Vthre Vrest
Z
C2
Z s (u ) du ¢
Vthre
³Z
Vrest
Vthre y
Vrest
m ( u ) du ¡1
´
s ( u) du ¢ m ( y ) dy.
(C.3)
For a 2 R, let Ga,Vthre ( Vrest , y ) D y be the mean time spent in ( y, y C D y) by the process Xt started at Vrest and run until it exits ( a, Vthre ). The expression for Ga,Vthre (Vrest , y ) is given in Karlin and Taylor (1982, p. 198). Therefore, we have Z hTi D lim
Vthre
a!¡1 a
Ga,Vthre ( Vrest , y ) dy,
and theorem 2 follows. Acknowledgments
We are grateful to the anonymous referees for their valuable comments on an earlier version of this article. The work was partially supported by BBSRC and an ESEP grant of the Royal Society. References Albeverio, S., Feng, J., & Qian, M. (1995) Role of noises in neural networks Phys. Rev. E., 52, 6593–6606. Andersen, P. (1999). A spine to remember Nature, 399, 19–21. Bernander, O., Douglas, R. J., Martin, K.A.C., and Koch, C. (1992). Synaptic background activity inuences spatiotemporal integration in single pyramidal cells. PNAS, 88, 11569–11573. Brown, D., Feng, J., & Feerick, S. (1999). Variability of ring of Hodgkin-Huxley and FitzHugh-Nagumo neurons with stochastic synaptic input. Phys. Rev. Lett., 82, 4731–4734. Collins, J. J., Chow, C. C., & Imhoff, T. T. (1995). Stochastic resonance without tuning. Nature, 376, 236–238.
Impact of Geometrical Structures
639
Destexhe, A., Mainen, Z. F., & Sejnowski, T. J. (1998). Kinetic models of synaptic transmition. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling. Cambridge, MA: MIT Press. Destexhe, A. & Pare, D. (2000). A combined computational and intracellular study of correlated synaptic bombardment in neocortical pyramidal neurons in vivo. Neurocomputing, 32, 113–119. Engert, F., & Bonhoeffer, T. (1999). Dendritic spine changes associated with hippocampal long-term synaptic plasticity. Nature, 399, 66–70. Ermentrout, B. (1996).Type I membranes, phase resetting curves, and synchrony. Neural Computation, 8, 979–1001. Feng, J. (1997). Behaviours of spike output jitter in the integrate-and-re model. Phys. Rev. Lett., 79, 4505–4508. Feng, J. (in press). Is the integrate-and-re model good enough?—A review. Neural Networks. Feng, J., & Li, G. B. (2001). Two compartment model as slope detector. Manuscript in preparation. Harris, C. M., & Wallpert, D. M. (1998). Signal-dependent noise determines motor planning. Nature, 394, 780–783. Karlin, S., & Taylor, H. M. (1982). A second course in stochastic processes. New York: Academic Press. Koch, C. (1999). Biophysics of computation. Oxford: Oxford University Press. Leadbetter, M. R., Lindgren, G., & RootzÂen, H. (1983). Extremes and related properties of random sequences and processes. New York: Springer-Verlag. Mainen, Z. F., & Sejnowski, T. J. (1996). Inuence of dendritic structure on ring pattern in model neocortical neuron. Nature, 382, 363–366. Musila, M., & LÂansk Ây, P. (1994). On the interspike intervals calculated from diffusion approximations for Stein’s neuronal model with reversal potentials. J. Theor. Biol., 171, 225–232. Pinsky, P. F., & Rinzel, J. (1994). Intrinsic and network rhythmogenesis in a reduced Traub model for CA3 neurons. J. Computational Neuroscience, 1, 39– 60. Ricciardi, L. M., & Sato, S. (1990).Diffusion process and rst-passage-times problems. In L. M. Ricciardi (Ed.), Lectures in applied mathematics and informatics. Manchester: Manchester University Press. Sejnowski, T. J. (1998). Making smooth moves. Nature, 394, 725–726. Schneidman, E., Freedman, B., & Segev, I. (1998). Ion channel stochasticity may be critical in determining the reliability and precision of spike timing. Neural Computation, 10, 1679–1703. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18, 3870–3896. Stern, J. E., & Armstrong, W. E. (1998). Reorganization of the dendritic trees of oxytocin and vasopressin neurons of the rat supraoptic nucleus during lactation. J. Neuroscience, 18, 841–853. Troyer, T. W., & Miller, K. D. (1997).Physiological gain leads to high ISI variability in a simple model of a cortical regular spiking cell. Neural Computation, 9, 971– 983.
640
Jianfeng Feng and Guibin Li
Tuckwell, H. C. (1988). Stochastic processes in the neurosciences. Philadelphia: Society for Industrial and Applied Mathematics. Wang, X. J. (1998). Calcium coding and adaptive temporal computation in cortical pyramidal neurons. J. Neurophysiol., 79, 1549–1566.
Received September 18, 2000; accepted May 22, 2001.
LETTER
Communicated by Chris Williams
Sparse On-Line Gaussian Processes Lehel Csat Âo
[email protected] Manfred Opper
[email protected] Neural Computing Research Group, Department of Information Engineering, Aston University, B4 7ET Birmingham, U. K. We develop an approach for sparse representations of gaussian process (GP) models (which are Bayesian types of kernel machines) in order to overcome their limitations for large data sets. The method is based on a combination of a Bayesian on-line algorithm, together with a sequential construction of a relevant subsample of the data that fully species the prediction of the GP model. By using an appealing parameterization and projection techniques in a reproducing kernel Hilbert space, recursions for the effective parameters and a sparse gaussian approximation of the posterior process are obtained. This allows for both a propagation of predictions and Bayesian error measures. The signicance and robustness of our approach are demonstrated on a variety of experiments.
1 Introduction
Gaussian processes (GP) (Bernardo & Smith, 1994; Williams & Rasmussen, 1996) provide promising Bayesian tools for modeling real-world statistical problems. Like other kernel-based methods, such as support vector machines (SVMs) (Vapnik, 1995), they combine a high exibility of the model by working in often innite-dimensional feature spaces with the simplicity that all operations are “kernelized”—performed in the lower-dimensional input space using positive denite kernels. An important advantage of GPs over other non-Bayesian models is the explicit probabilistic formulation of the model. This allows the modeler to assess the uncertainty of the predictions by providing Bayesian condence intervals (for regression) or posterior class probabilities (for classication). It also opens the possibility of treating a variety of other nonstandard data models (e.g., quantum inverse statistics, Lemm, Uhlig, & Weiguny, 2000; wind-elds, Evans, Cornford, & Nabney, 2000; Berliner, Wikle, & Cressie, 2000) using a kernel method. Neural Computation 14, 641–668 (2002)
° c 2002 Massachusetts Institute of Technology
642
Lehel CsatÂo and Manfred Opper
GPs are nonparametric in the sense that the “parameters” to be learned are functions fx of a usually continuous input variable x 2 Rd . The value fx is used as a latent variable in a likelihood P ( y| fx,x ) that denotes the probability of an observable output variable y given the input x. The a priori assumption on the statistics of f is that of a gaussian process: any nite collection of random variables fi is jointly gaussian. Hence, one must specify the prior means and the prior covariance function of the variables fx . The latter is called the kernel K0 ( x, x0 ) D Cov ( y, y0 ) (Vapnik, 1995; Kimeldorf & Wahba, 1971). Thus, if a zero-mean GP is assumed, the kernel K0 fully species the entire prior information about the model. Based on a set of input-output observations ( xn , yn ) (n D 1, . . . , N), the Bayesian approach computes the posterior distribution of the process fx using the prior and the likelihood (Williams, 1999; Williams & Rasmussen, 1996; Gibbs & MacKay, 1999). A straightforward application of this simple appealing idea is impeded by two major obstacles: nongaussianity of the posterior process and the size of the kernel matrix K0 ( xi , xj ). A rst obvious problem stems from the fact that the posterior process is usually nongaussian (except when the likelihood itself is gaussian in the fx ). Hence, in many important cases, its analytical form precludes an exact evaluation of the multidimensional integrals that occur in posterior averages. Nevertheless, various methods have been introduced to approximate these averages. A variety of such methods may be understood as approximations of the nongaussian posterior process by a gaussian one (Jaakkola & Haussler, 1999; Seeger, 2000); for instance, in Williams and Barber (1998), the posterior mean is replaced by the posterior maximum (MAP), and information about the uctuations is derived by a quadratic expansion around this maximum. The computation of these approximations, which become exact for regression with gaussian noise, require the solution of a system of coupled nonlinear equations of the size equal to the number of data points. The second obstacle that prevents GPs from being applied to large data sets is that the matrix that couples these equations is typically not sparse. Hence, the development of good sparse approximations is of major importance. Such approximations aim at performing the most time-consuming matrix operations (inversions or diagonalizations) only on a representative subset of the training data. In this way, the computational time is reduced from O ( N3 ) to O (Np2 ) , where N is the total size of the training data and p is the size of the representative set. The memory requirement is O ( p2 ) as opposed to O ( N 2 ) . So far, a variety of sparsity techniques (Smola & Sch¨olkopf, 2000; Williams & Seeger, 2001) for batch training of GPs have been proposed. This article presents a new approach that combines the idea of a sparse representation with an on-line algorithm that allows for a speedup of the GP training by sweeping through a data set only once. A different sparse approximation, which also allows for an on-line processing, was recently introduced by Tresp (2000). It is based on combining predictions of models trained on smaller data subsets and needs an additional query set of inputs.
Sparse On-Line Gaussian Processes
643
Central to our approach are exact expressions for the posterior means h fx it and the posterior covariance Kt ( x, x0 ) (subscripts denote the number of data points), which are derived in section 2. Although both quantities are continuous functions, they can be represented as nite linear (or respective bilinear) combinations of kernels K0 ( x, xi ) evaluated at the training inputs xi (Csat Âo, FokouÂe, Opper, Schottky, & Winther, 2000). Using sequential projections of the posterior process on the manifold of gaussian processes, we obtain approximate recursions for the effective parameters of these representations. Since the size of representations grows with the number of training data, we use a second type of projection to extract a smaller subset of input data (reminiscent of the Support Vectors of Vapnik, 1995, or Relevance Vectors of Tipping, 2000). This subset builds up a sparse representation of the posterior process on which all predictions of the trained GP model rely. Our approach is related to that introduced in Wahba (1990). Although we use the same measure for projection, we do not x the set of basis vectors from the beginning but decide on-line which inputs to keep. 2 On-Line Learning with Gaussian Processes
In Bayesian learning, all information about the parameters that we wish to infer is encoded in probability distributions (Bernardo & Smith, 1994). In the GP framework, the parameters are functions, and the GP priors specify a gaussian distribution over a function space. The posterior process is entirely specied by all its nite-dimensional marginals. Hence, let f D f f ( x1 ) , . . . , f ( xM )g be a set of function values such that fD µ f, where fD is the set of f ( xi ) D fi with xi in the observed set of inputs, we compute the posterior distribution using the data likelihood together with the prior p 0 (f ) as ppost (f ) D
P ( D | f ) p0 ( f ) , hP( D | f D ) i0
(2.1)
where hP( D | f D ) i0 is the average of the likelihood with respect to the prior GP (GP at time 0). This form of the posterior distribution can be used to express posterior expectations as typically high-dimensional integrals. For prediction, one is especially interested in expectations of functions of the process at inputs, which are not contained in the training set. At rst glance, one might assume that every prediction on a novel input would require the computation of a new integral. Even if we had good methods for approximate integration, this would make predictions at new inputs a rather tedious task. Luckily, the following lemma shows that simple but important predictive quantities like the posterior mean and the posterior covariance of the process at arbitrary inputs can be expressed as a combination of a nite set of parameters that depend on the training data only. For arbitrary likelihoods we can show that:
644
Lehel CsatÂo and Manfred Opper
The result of the Bayesian update, equation 2.1, using a GP prior with mean function h fx i0 and kernel K0 (x, x0 ) and data D D f(xn , yn ) | n D 1, . . . , Ng, is a process with mean and kernel functions given by Lemma 1 (parameterization).
h fx ipost D h fx i0 C
N X
K0 ( x, xi ) q (i)
iD 1
Kpost (x, x0 ) D K0 (x, x0 ) C
N X
(2.2) K0 ( x, xi ) R ( ij) K 0 ( xj , x0 ).
i, j D 1
The parameters q ( i) and R (ij ) are given by Z 1 @P ( D | f ) q ( i) D df p0 ( f ) and Z @f ( xi ) Z 1 @2 P ( D | f ) ¡ q ( i) q ( j) , R ( ij ) D dfp 0 ( f ) Z @f ( xi ) @f ( xj ) where f D [ f ( x1 ) , . . . , f ( xN ) ]T and Z D stant.
R
(2.3)
df p 0 ( f ) P ( D | f ) is a normalizing con-
The parameters q ( i) and R ( ij ) have to be computed only once during the training of the model and are xed when we make predictions. The parametric form of the posterior mean (assuming a zero mean for the prior) resembles the representations for the predictors in other kernel approaches (such as Support Vector machines) that are obtained by minimizing certain cost functions. While the latter representations are derived from the celebrated representer theorem of Kimeldorf and Wahba (1971), our result, equation 2.2, does, to our best knowledge, not follow from this but is derived from simple properties of gaussian distributions. To keep focused on the main ow, we defer the proof to appendix B. Making an immediate use of this representation is usually not possible because the posterior process is in general not gaussian and the integrals cannot be computed exactly. Hence, we need approximations in order to keep the inference tractable (Csat Âo et al., 2000). One popular method is to approximate the posterior by a gaussian process (Williams & Barber, 1998; Seeger, 2000). This may be formulated within a variational approach, where a certain dissimilarity measure between the true and the approximate distribution is minimized. The most popular choice is the Kullback-Leibler divergence between distributions, dened as Z p (h ) (2.4) KL ( p|q ) D dh p (h ) ln , q (h ) where h denotes the set of arguments of the densities. If pO denotes the apO post ) proximating gaussian distribution, one usually tries to minimize KL ( pkp
p0
dKL p^1
,y N
ppost
(x N
(x1 , y1 )
ppost
645
)
Sparse On-Line Gaussian Processes
p^N 1
^ p N
Figure 1: Visualization of the on-line approximation of the intractable posterior process. The resulting approximate process from previous iteration is used as the prior for the next one.
O which, in contrast to KL ( ppost kpO), requires only the comwith respect to p, putation of expectations over tractable distributions. In this article, we use a different approach. To speed up the learning process in order to allow for the learning of large data sets, we aim at learning the data by a single sequential sweep through the examples. Let pOt denote the gaussian approximation after processing t examples; we use Bayes’ rule, ppost (f ) D
P ( yt C 1 | f ) pOt ( f ) , hP(yt C 1 | f D ) it
(2.5)
to derive the updated posterior. Since ppost is no longer gaussian, we use a variational technique in order to project it to the closest gaussian process pOt C 1 (see Figure 1). Unlike the usual variational method, we now minimize the divergence KL ( ppostkpO) . This is possible because in our on-line method, the posterior, equation 2.5, contains only the likelihood for a single example, and the corresponding nongaussian integral is one-dimensional, which can be performed analytically for many relevant cases. It is a simple exercise to show (Opper, 1998) that the projection results in the matching of the rst two moments (mean and covariance) of ppost and the approximated gaussian posterior pOt C 1 . We expect that the use of the divergence KL ( ppostkpO) has several advantages over other variational methods (Gibbs & MacKay, 1999; Williams & Barber, 1998; Jaakkola & Haussler, 1999; Seeger, 2000). First, this choice avoids the numerical optimizations that are usually necessary for the divergence with inverted arguments. Second, this method is very robust, allowing for arbitrary choices of the single data likelihood. The likelihood can be noncontinuous and may even vanish over some range of values of the process. Finally, if one interprets the KL divergence as the expectation of the relative log loss of two distributions, our choice of divergence weights the losses with the correct distribution rather than the approximated one.
646
Lehel CsatÂo and Manfred Opper
We expect that this may correspond to an improved quality of approximation. In order to compute the on-line approximations of the mean and covariance kernel Kt , we apply lemma 1 sequentially with only one likelihood term P ( yt |xt ) at an iteration step. Proceeding recursively, we arrive at h fx it C 1 D h fx it C q ( t C 1) Kt (x, xt C 1 )
Kt C 1 ( x, x0 ) D Kt (x, x0 ) C r (t C 1) Kt ( x, xt C 1 ) Kt (xt C 1 , x0 ) ,
(2.6)
where the scalars q ( t C 1) and r ( t C 1) follow from lemma 1 (see appendix B for details): q ( t C 1) D r ( t C 1) D
@ @h ft C 1 it @2
@h ft C 1 it
lnhP(yt C 1 | ft C 1 ) it lnhP(yt C 1 | ft C 1 ) it .
(2.7)
The averages in equation 2.7 are with respect to the gaussian process at time t and the derivatives taken with respect to h ft C 1 it D h f ( xt C 1 ) it . Note again that these averages require only a one-dimensional integration over the process at the input xt C 1 . Unfolding the recursion steps in the update rules, equation 2.6, we arrive at the parameterization for the approximate posterior GP at time t as a function of the initial kernel and the likelihoods (“natural parameterization”): h fx it D
t X iD 1
K0 ( x, xi )at (i) D ®Tt kx
Kt ( x, x0 ) D K0 ( x, x0 ) C
t X
K0 ( x, xi ) Ct (ij ) K0 (xj , x0 )
(2.8)
i, j D 1
D K0 ( x, x0 ) C kT x C t kx0 ,
with coefcients at (i) and Ct ( ij ) not depending on x and x0 (for details, see appendix C). For simplicity, the values at ( i) are grouped into the vector ®t D [at (1) , . . . , at ( t) ]T , C t D fCt ( ij ) gi, jD 1,t and we also used vectorial (typeset in bold) notations for kx D [K0 (x1 , x ) , . . . , K 0 ( xt , x )]T . The recursion for the GP parameters in equation 2.8 is found from the recursion equation 2.6 and the parameterization lemma:
®t C 1 D Tt C 1 ( ®t ) C q (t C 1) s t C 1 C t C 1 D Ut C 1 ( C t ) C r ( t C 1) st C 1 s TtC 1 s t C 1 D Tt C 1 ( C t k t C 1 ) C e t C 1 ,
(2.9)
Sparse On-Line Gaussian Processes
647
where kt C 1 D kxt C 1 and e t C 1 the t C 1th unit vector and s t C 1 is introduced for clarity. We also introduced the operators Tt C 1 and Ut C 1 . They extend a t-dimensional vector and matrix to a t C 1-dimensional one by appending zeros at the end of the vector and to the last row and column of the matrix, respectively. Since e t C 1 is the t C 1th unit vector, we see that the dimension of the vector ® and the size of matrix C increase with each likelihood point added. Equations 2.6 and 2.7 show some resemblance to the well-known extended Kalman lter. This is to be expected, because the latter approach can also be understood as a sequential propagation of an approximate gaussian distribution. However, the main difference between the two methods is in the way the likelihood model is incorporated. While the extended Kalman lter (see Bottou, 1998, for a general framework) is based on a linearization of the likelihood, our approach uses a more robust smoothing of the likelihood instead. The drawback of using equation 2.9 in practice is the quadratic increase of the number of parameters with the number of training examples. This is a feature common to most other methods of inference with gaussian processes. A modication of the learning rule that controls the number of parameters is the main contribution of this article and is detailed in the following. 3 Sparseness in Gaussian Processes
Sparseness can be introduced within the GP framework by using suitable approximations on the representation equation 2.8. Our goal is to perform an update without increasing the number of parameters ® and C when, according to a certain criterion, the error due to the approximation is not too large. This could be achieved exactly if the new input xt C 1 would be such that the relation
K0 (x, xt C 1 ) D
t X iD 1
eOt C 1 ( i) K0 (x, xi )
(3.1)
holds for all x. In such a case, we would have a representation for the updated process in the form equation 2.8 using only the rst t inputs, but with O and CO . A glance at equation 2.9 shows that “renormalized” parameters ® the only change would be the replacement of the vector st C 1 by sO t C 1 D C t kt C 1 C eO t C 1 .
(3.2)
Note that eO t C 1 is a vector of dimensionality t. Unfortunately, for most kernels and inputs xt C 1 , equation 3.1 cannot be fullled for all x. Nevertheless, as
648
Lehel CsatÂo and Manfred Opper
an approximation, we could try an update of the form 3.2, where eO t C 1 is determined by minimizing the error measure, ® ®2 ® ® t X ® ® eOt C 1 (i) K0 (¢, xi ) ® , ®K0 (¢, xt C 1 ) ¡ ® ® D1
(3.3)
i
where k ¢ k is a suitably dened norm in a space of functions of inputs x (related optimization criteria in a function space are presented by Vijayakumar & Ogawa, 1999). Equation 3.3 becomes especially simple when the norm is based on the inner product of the reproducing kernel Hilbert space (RKHS) generated by the kernel K 0P . In this case, for any twoPfunctions g and h that are represented as g ( x ) D i ci K0 (x, ui ) and h ( x ) D i di K0 ( x, vi ) , for some arbitrary set of ui ’s and vi ’s, the RKHS inner product is dened as (Wahba, 1990) ¡
g (¢) , h (¢)
¢ RKHS
X
D
ci dj K 0 ( ui , vi )
(3.4)
ij
with norm ® ®2 ® g®
¡
RKHS
D g (¢) , g (¢)
¢ RKHS
D
X
ci cj K 0 ( ui , uj ) .
(3.5)
ij
Hence, in this case, equation 3.3 is K 0 ( xt C 1 , xt C 1 ) C ¡2
t X i D1
t X i, j D 1
eOt C 1 ( i) eOt C 1 ( j) K0 ( xi , xj )
eOt C 1 ( i) K 0 ( xt C 1 , xi ) ,
(3.6)
and simple minimization of equation 3.6 yields (Smola & Sch¨olkopf, 2000) (¡1)
eO t C 1 D K t
kt C 1 ,
(3.7)
where K t D fK0 (xi , xj )gi, jD 1,t is the Gram matrix. The expression b0 ( x, xt C 1 ) D K
t X i D1
eOt C 1 ( i) K 0 ( x, xi )
(3.8)
is simply the orthogonal projection (in the sense of the inner product, equation 3.4) of the function K 0 ( x, xt C 1 ) on the linear span of the functions K0 (x, xi ). The approximate update using equation 3.2 will be performed
Sparse On-Line Gaussian Processes
Ft+ 1
649
F1 ^ F t+ 1
(F,...,F 1 t) Ft
Fre s Figure 2: Visualization of the projection step. The new feature vector w t C 1 is projected to the subspace spanned by fw 1 , . . . , w t g, resulting in the projection wO tC 1 and the orthogonal residual (the “left-out” quantity) w res. It is important that w res has t C 1 components; it needs the extended basis including w tC 1 .
only when a certain measure for the approximation error (to be discussed later) is not exceeded. The set of inputs, for which the exact update is performed and the number of parameters is increased, will be called basis vector set (BV set), an element will be BV . Proceeding sequentially, some of the inputs are left out, and others are included in BV set. However, due to the projection 3.8, the inputs left out from BV set will still contribute to the nal GP conguration—the one used for prediction and to measure the posterior uncertainties. But the latter inputs will not be stored and do not lead to an increase of the size of the parameter set. This procedure leads to the new representation for the posterior GP only in terms of the BV set and the corresponding parameters ® and C : h fx i D
X
K0 ( x, xi )a(i)
i2BV
K ( x, x0 ) D K 0 ( x, x0 ) C
X
K0 ( x, xi ) C ( ij ) K0 ( xj , x0 ).
(3.9)
i, j2BV
An alternative derivation of these results can be obtained from the representation of the Mercer kernel (Wahba, 1990; Vapnik, 1995), K0 (x, x0 ) D w ( x) T w ( x0 ) ,
(3.10)
in terms of the possibly innite-dimensional “feature vector” w ( x) (Wahba, 1990). Minimizing equation 3.6 and using equation 3.2 for an update is equivalent to replacing the feature vector w t C 1 corresponding to the new input by its orthogonal projection, wOt C 1 D
X i
eOt C 1 (i) w i ,
(3.11)
650
Lehel CsatÂo and Manfred Opper
onto the space of the old feature vectors (as in Figure 2). Note, however, that this derivation may be somewhat misleading by suggesting that the mapping induced by the feature vectors plays a special role for our approximation. This would be confusing because the representation equation, 3.10, is not unique. Our rst derivation based on the RKHS norm shows, however, that our approximation uses only geometrical properties that are induced by the kernel K0 . 3.1 Projection-Induced Error. We need a rule to decide if the current input will be included in the BV set. We base the decision on a measure of change on the sample averaged posterior mean of the GP due to the sparse approximation. Assuming a learning scenario where only the basis vectors are memorized, we measure the change of the posterior mean due to the approximation by
D h fx it C1
D h fx it C 1 ¡ h fx i[ , tC1
where h fx i[ is the posterior mean with respect to the approximated protC 1 cess. Summing up the absolute values of the changes for the elements in the BV set and the new input leads to
et C 1 D
tC1 X iD 1
D |q
| D h fi it C 1 | D |qt C 1 |
( t C 1)
tC1 X b0 (xi , xt C 1 ) K0 ( xi , xt C 1 ) ¡ K iD 1
® ® b0 (¢, xt C 1 ) ®2 | ®K0 (¢, xt C 1 ) ¡ K , RKHS
(3.12)
where the second line follows from the orthogonal projection together with the denition of the inner product in the RKHS (see equation 3.5) It is an important observation that, also due to orthogonal projection, the b0 (¢, xt C 1 ) D K 0 (¢, xt C 1 ) error is concentrated only on the last data point since K at the old data points xi , i D 1, . . . , t. Rewriting equation 3.12 using the b0 (¢, xt C 1 ) from equation 3.7, the error is coefcients for K ± ² et C 1 D |q ( t C 1) | k¤t C 1 ¡ kTtC 1 K t¡1 kt C 1 D |q ( t C 1) |c t C 1 , (3.13) where k¤t C 1 D K0 ( xt C 1 , xt C 1 ) and q ( t C 1) is given from equation 2.7. The error measure et C 1 is a product of two terms. If the new input would be included in BV , the corresponding coefcient at C 1 in the posterior mean would be equal to q ( t C 1) , which is the likelihood-dependent part. The second term, c t C 1 D kt¤C 1 ¡ kTtC 1 K t¡1 kt C 1 ,
(3.14)
gives the geometrical part, which is the squared norm of the “residual vector” from the projection in the RKHS (shown in Figure 2), or, equivalently,
Sparse On-Line Gaussian Processes
651
the “novelty” of the current input. If we use the RBF kernels, then the error equation, 3.13, is similar to the one used in deciding if new centers have to be included in the resource-allocating network (McLachlan & Lowe, 1996; Platt, 1991). To compute the geometrical component of the error et C 1 , a matrix inversion is needed at each step. The costly matrix inversion can be avoided by keeping track of the inverse Gram matrix Q t D K t¡1 . The updates for the matrix can also be expressed with the variables c t C 1 and eO t C 1 (for details, see appendix D), and these updates will be important when deleting a BV : ( ( O t C 1 ) ¡ e t C 1 ) ( Tt C 1 ( eO t C 1 ) ¡ e t C 1 ) T , (3.15) Q t C 1 D Ut C 1 ( Q t ) C c t¡1 C 1 Tt C 1 e where Ut C 1 and Tt C 1 are the extension operators for a matrix and a vector, respectively (introduced in equation 2.9). 3.2 Deleting a Basis Vector. Our algorithm may run into problems when there is no possibility of including new inputs into BV without deleting one of the old basis vectors because we are at the limit of our resources. This gives the motivation to implement pruning: whenever a new example is found important, one should get rid of the BV with the smallest error and replace it by the new input vector. First, we discuss the elimination of a BV and then the criterion based on which we choose the BV to be removed. To remove a basis vector from the BV set, we rst assume that the respective BV has just been added; thus, the previous update step was done with e t C 1 , the t C 1th unit vector. With this assumption, we identify the elements q (t C 1) , r ( t C 1) , and st C 1 from equation 2.9; compute eO t C 1 (this computation is also replaced by an identication from equation 3.15); and use equation 3.11 for an update without including the new point in the BV set. If we assume t C 1 basis vectors, ®t C 1 has t C 1 elements, and the matrices C t C 1 and Q t C 1 are ( t C 1) £ ( t C 1) . Further assuming that we want to delete the last added element, the decomposition is as illustrated in Figure 3. Computing the “previous” model parameters and then using the nonincreasing update leads to the deletion equations (see appendix E for details):
O D ® ( t) ¡ a¤ ® O D C ( t ) C c¤ C O D Q ( t) ¡ Q
Q¤
q¤
Q ¤ Q ¤T
q¤2
Q ¤ Q ¤T
q¤
,
¡
i 1 h ¤ ¤T C C ¤ Q ¤T Q C q¤
(3.16)
652
Lehel CsatÂo and Manfred Opper
a t+ 1
C t+ 1
Q t+ 1
1 .. t
a
C
C*
Q
Q*
t+ 1
a*
C*
c*
Q*
q*
t+ 1
1 .. t
t+ 1
(t)
(t)
1 .. t
(t)
Figure 3: Grouping of the GP parameters for the update equation, 3.16.
O are the parameters after the deletion of the last basis vecO , CO , and Q where ® tor and C ( t ) , Q ( t ) , ® ( t) , Q ¤ , C ¤ , q¤ , and c¤ are taken from GP parameters before deletion. A graphical illustration of each element is provided in Figure 3. Of particular interest is the identication of the parameters q ( t C 1) and c t C 1 since their product gives the score of the basis vector that is being deleted. This leads to the score
et C 1 D
a¤ at C 1 (t C 1) . D q¤ Qt C 1 ( t C 1, t C 1)
(3.17)
Thus, we have the score for the last input point. Neglecting dependence of the GP posterior on the ordering of the data, equation 3.17 gives us a score measure for each element i in the BV set,
ei D
|at C 1 ( i) | , Qt C 1 (i, i)
(3.18)
by rearranging the order in BV with element i at the last position. To summarize, if a deletion is needed, then the basis vector with minimal score (from equation 3.18) will be deleted. The scores are computationally cheap (linear). 3.3 The Sparse GP Algorithm. The following numerical experiments are based on the version of the algorithm that assumes a given maximal size for the BV set. We start by initializing the BV set with an empty set, the maximal number of elements in the BV set with d, the prior kernel K0 , and a tolerance parameter 2 tol . This latter will be used to prevent the Gram matrix from being singular and is used in step 2. The GP parameters ®, C , and the inverse Gram matrix Q are set to empty values. For each data element ( yt C 1 , xt C 1 ) , we will iterate the following steps:
1. Compute q (t C 1) , r ( t C 1) , k¤t C 1 , kt C 1 , eO t C 1 , and c t C 1 .
2. If c t C 1 < 2 tol , then perform a reduced update with sO t C 1 from equation 3.2 without extending the size of the parameters ® and C . Advance to the next data.
Sparse On-Line Gaussian Processes
653
3. (else) Perform the update equation 2.9 using the unit vector et C 1 . Add the current input to the BV set, and compute the inverse of the extended Gram matrix using equation 3.15. 4. If the size of the BV set is larger than d, then compute the scores ei for all BV s from equation 3.18, nd the basis vector with the minimum score, and delete it using equations 3.16. The computational time for a single step is quadratic in d, the maximal number of BV s allowed. Having an iteration over all data, the computational time is O (Nd2 ) . This is a signicant improvement from the O (N 3 ) scaling of the nonsparse GP algorithm. Since we propose a “subspace” algorithm, we have the same computing time as in Wahba (1990) and the Nystr om ¨ approximation for kernel matrices in Williams and Seeger (2001). An important feature of the sparse GP algorithm is that we provide an approximation to the posterior kernel of the process providing predictive variance for a new inputs, as shown by the error bars in Figure 4. Another aspect of the algorithm is that the basis vectors are selected during runtime from the data. A nal remark is that the iterative computation of the inverse Gram matrix Q prevents the numerical instabilities caused by inverting a singular Gram matrix: c t C 1 is zero if the new element xt C 1 would make the Gram matrix singular. The comparison of c t C 1 with the preset tolerance value 2 td prevents this. 4 Experimental Results
In all experiments, we used spherical RBF kernels
³ 0
K (x, x ) D exp
kx ¡ x0 k2 2dsK2
´ ,
(4.1)
where sK is the width of the kernel and d is the input dimension. 4.1 Regression. In the regression model, we assume a multidimensional input x 2 Rm and an output y with the likelihood
(
P ( y|x ) D p
ky ¡ fx k2 exp ¡ 2s02 2p s0 1
) .
(4.2)
Since the likelihood is gaussian, the use of a gaussian posterior in the on-line algorithm is exact. Hence, only the sparsity will introduce an approximation into our procedure. For a given number of examples, the parameterization 2.8 of the posterior in terms of ® and C leads to a predictive distribution of
654
Lehel CsatÂo and Manfred Opper
1.2 1.2
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0.2
0.2
0.4 2
1.5 1
0.5
0
0.5
1
1.5
2
0.4 2
1.5 1
0.5
0
0.5
1
1.5
2
Figure 4: Results of the GP regression with 1000 noisy training data points with noise s02 D 0.02. The gures show the results for RBF kernels with two different widths. (Left) The good t of the GP mean function (continuous lines) to the true function (dashed lines) is also consistent with the tight Bayesian error bars (dashdotted lines) around the means. (Right) The error bars are broader, reecting the larger uncertainty. The BV set is marked with rhombs and we kept 10 and 15 basis vectors. We used sK2 D 1 for the left, and sK2 D 0.1 for the right subgure, respectively.
y for an input x, ³ p ( y|x, ®, C ) D
1 2p sx2
´2
» ¼ ky ¡ ®T kx k2 exp ¡ , 2sx2
(4.3)
with sx2 D s02 C kTx Ct kx C k¤x . The on-line update rules, equation 2.9, for ® and C in terms of the parameters q ( t C 1) and r ( t C 1) are q ( t C 1) D ( y ¡ ®Tt kx ) / sx2
r ( t C 1) D ¡
1 . sx2
(4.4)
To illustrate the performance, we begin with the toy example y D sin ( x ) / x C g, where g is a zero-mean gaussian noise with variance s02 D 0.02. The results for the posterior means and the Bayesian error bars together with the basis vectors are shown in Figure 4. The large error bars obtained for the “misspecied” kernel (with small width sK2 D 0.1) demonstrate the advantage of propagating not only the means but also the uncertainties in the algorithm. The second data set is the Friedman data set 1 (Friedman, 1991), an articial data set frequently used to assess the performance of regression algorithms. For this example, we demonstrate the effect of the approximation introduced by the sparseness. The upper solid line in Figure 5 shows the development of the test error with increasing numbers of examples without sparseness, that is, when all data are included sequentially. The dots are the
Sparse On-Line Gaussian Processes A v e r a g e p e r fo rm a n c e fo r th e F r ie d m a n d a ta s e t
10
Test error (500 test points)
655
F u ll G P s o lu tio n B V re m o v a l F ix e d B V s e t s iz e
9 8 7 6 5 4 3 2 1 50
100
150 200 250 Ite ra tio n # / B V n u m b e r
300
Figure 5: Results for the Friedman data using the full GP regression (continuous line) and the proposed sparse GP algorithm with a xed BV size (dots with error bars). The dash-dotted line is obtained by sequentially reducing the size of the BV set. The lines show the average performance over 50 runs. The full GP solution uses only the specied number of data, whereas the other two curves are obtained by iterating over the full data set (sK2 D 1 was used with 300 training and 500 test data).
test errors obtained by running the sparse algorithm using different sizes of the BV set. We see that almost two-thirds of the original training set can be excluded from the BV set without a signicant loss of predictive performance. Finally, we have tested the effect of the greediness of the algorithm by adding or removing examples in different ways. The dependence on the data of the sparse GP is shown with the error bars around the dots, and the dependence of the result on the different orders is well within these error bars. The dash-dotted line is obtained by rst running the on-line algorithm without sparseness on the full data set and then building sets BV of decreasing sizes by removing the least signicant examples one after the other. Remarkably, the performance is rather stable against these variations in the plateau region of (almost) constant test error. 4.2 Classication. For classication, we use the probit model (Neal, 1997) where a binary value y 2 f¡1, 1g is assigned to an input x 2 Rm with the data likelihood
³
y fx P ( y| fx ) D Erf s0
´ .
(4.5)
656
Lehel CsatÂo and Manfred Opper Binary classification: 4 non 4 Single sweep Two sweeps
5.5
7 Test Error %
5 Test Error %
Multiclass classification Single Sweep Two Sweeps
7.5
6.5
4.5 4
6
5.5
3.5
5
3
4.5 100 150 200 250 300 350 400 450 500 550 Basis Vector #
2.5 100 150 200 250 300 350 400 450 500 550 Basis Vector #
Figure 6: Results for the binary (left) and multiclass (right) classication. The multiclass case is a combination of the 10 individual classiers. The example x is assigned to the class with highest P ( Ci |x) . We compare different sizes of the BV set and the effect of reusing data a second time (signied by the circles).
Erf(x) is the cumulative gaussian distribution,1 with s0 the noise variance. The predictive distribution for a new example x is ³ p ( y|x, ®, C ) D hP(y| fx ) it D Erf
´ y h fx i , sx
(4.6)
where h fx i the mean of the GP at x given by equation 2.8 and sx2 D s02 C k¤x C k Tx Ckx . Based on equation 2.9, for a given input-output pair ( x, y) , the update coefcients q (t C 1) and r ( t C 1) are computed (for details, see Csat Âo et al., 2000):
q
( t C 1)
y Erf0 D sx Erf
r
( t C 1)
1 D 2 sx
(
Erf00 ¡ Erf
³
Erf0 Erf
´2 )
y ®T k
,
(4.7)
with Erf(z) evaluated at z D sxt x and Erf0 and Erf00 are the rst and second derivatives at z. We have tested the sparse GP algorithm on the U. S. Postal Service (USPS) data set2 of gray-scale handwritten digit images (of size 16 £ 16) with 7291 training patterns and 2007 test patterns. In the rst experiment, we studied the problem of classifying the digit 4 against all other digits. Figure 6 (left) plots the test errors of the algorithm for different BV set sizes and xed values of hyperparameter sK2 D 1. 1 2
Rx p Erf(x) D ¡1 dt exp (¡t2 / 2) / 2p . Available from http://www.kernel-machines.org/ data/ .
Sparse On-Line Gaussian Processes
657
The USPS data set has been used previously to test the performance of other kernel-based classication algorithms that are based on a sparse representations. We mention the kernel PCA method of (Scholkopf ¨ et al., 1999) or the Nystr om ¨ method of Williams and Seeger (2001). They obtained slightly better results than our on-line algorithm. When the basis of the Nystr om ¨ approach is reduced to 512, the mean error is ¼ 1.7% (Williams & Seeger, 2001) and the PCA reduced-set method of Scholkopf ¨ et al. (1999) leads to an error rate of ¼ 5%. This may be due to the fact that the sequential replacement of the posterior by a gaussian is an approximation for the classication problem. Hence, some of the information contained in an example is lost even when the BV set would contain all data. As shown in in Figure 6, we observe a slight improvement when the algorithm sweeps several times through the data. However, it should be noted that the use of the algorithm (in its present form) on data that it has already seen is a mere heuristic and can no longer be justied from a probabilistic point of view. A change of the update rule based on a recycling of examples will be investigated elsewhere. We have also tested our method on the more realistic problem of classifying all 10 digits simultaneously. Our ability to compute Bayesian predictive probabilities is absolutely essential in this case. We have trained 10 classiers on the 10 binary classication problems of separating a single digit from the rest. A new input was assigned to the class with the highest predictive probability given by equation 4.5. Figure 6 summarizes the results for the multiclass case for different BV set sizes and gaussian kernels (with the external noise variance s02 D 0). In this case, the recycling of examples was of less signicance. The gap between our on-line result and the batch performance reported in Sch¨olkopf et al. (1999) is also smaller; this might be due to the Bayesian nature of the GPs that avoids the overtting. To reduce the computational cost, we used the same set for all individual classiers (only a single inverse of the Gram matrix was needed, and also the storage cost is smaller). This made the implementation of deleting a basis vector for the multiclass case less straightforward. For each input and each basis vector, there are 10 individual scores. We implemented a minimax deletion rule: whenever a deletion was needed, the basis vector having the smallest maximum value among the 10 classier problems was deleted; that is, the index l of the deleted input was l D arg min max eic . i2BV c20,9
(4.8)
Figure 7 shows the evolution of the test error when the sparse GP algorithm was initially trained with 1500 BV s and (without any retraining) the “least scoring” basis vectors are deleted. As in the regression case (see Figure 5), we observe a long plateau of almost constant test error when up to 70% of the BV s are removed.
658
Lehel CsatÂo and Manfred Opper T e s t E rro r w h e n re m o v in g B V s
45 40
Test Error (%)
35 30 25 20 15 10 5 0
0
5 00 10 00 # o f re m o v e d B V s (o u t o f 1 5 0 0 )
15 00
Figure 7: Performance of the combined classier trained with an initial BV size of 1500 and a sequential removal of basis vectors.
5 Conclusion and Further Investigations
We have presented a greedy algorithm that allows computing a sparse gaussian approximation for the posterior of GP models with general (factorizing) likelihoods based on a single sweep through the data set. So far, we have applied the method to regression and classication tasks and obtained a performance close to batch methods. The strength of the method lies in the fact that arbitrary, even noncontinuous likelihoods, which may be zero in certain regions, can be treated by our method. Such likelihoods may cause problems for other approximations based on local linearizations (advanced Kalman lters) or on the averaging of the log-likelihood (variational gaussian approximation). Our method merely requires the explicit computation of a gaussian smoothed likelihood and is thus well suited for cases where (local) likelihood functions can be modeled empirically as mixtures of gaussians. If such expressions are available, the necessary one-dimensional integrals can be done analytically, and the on-line updates require just matrix multiplications and function evaluations. A model of this structure for which we already have obtained promising preliminary results is the one used to predict wind elds from ambiguous satellite measurements based on a GP prior for the wind elds and a likelihood model for the measurement process. A further development of the method requires the solution of various theoretical problems at which we are now working. An important problem is to assess the quality of our approximations. There are two sources of errors: one coming from the gaussian on-line approximation and another
Sparse On-Line Gaussian Processes
659
stemming from the additional sparsity. In both cases, it is easy to obtain explicit expressions for single-step errors, but it is not obvious how to combine these in order to estimate the cumulative deviation between the true posterior and our approximation. It may be interesting to concentrate on the regression problem rst because in this case, there is no approximation involved in computing the posterior. A different question is the (frequentist) statistical quality of the algorithm. Our on-line gaussian approximation (without sparseness) was found to be asymptotically efcient (in the sense of Fisher) in the nite-dimensional (parametric) case (Opper, 1996, 1998). This result does not trivially extend to the present innite-dimensional GP case, and further studies are necessary. These may be based on the idea of an effective, nite dimensionality for the set of well-estimated parameters (Trecate, Williams, & Opper, 1999). Such work should also give an estimate for the sufcient number of basis vectors and explain the existence of the long plateaus (see Figures 7 and 5) with practically constant test errors. Besides a deeper understanding of the algorithm, we nd it also important to improve our method. Our sparse approximation was found to preserve the posterior means on previous data points when projecting on a representation that leaves out the current example. A further improvement might be achieved if information on the posterior variance would also be used (e.g., by taking the Kullback-Leibler loss rather than the RKHS norm) in optimizing the projection. This may, however, result in more complex and time-consuming updates. Our experiments show that in some cases, the performance of the on-line algorithm is inferior to a batch method. We expect that our algorithm can be adapted to a recycling of data (e.g., along the lines of Minka, 2000) such that a convergence to a sparse representation of the TAP mean-eld method (Opper & Winther, 1999) is achieved. A further drawback that will be addressed in future work is the lack of an (on-line) adaptation of the kernel hyperparameters. Rather than setting them by hand, an approximate propagation of posterior distributions for the hyperparameters would be desirable. Finally, there may be cases of probabilistic models where the restriction to unimodal posteriors as given by the gaussian approximation is too severe. Although we know that for gaussian regression and classication with logistic or probit functions the posterior is unimodal, for more complicated models, an on-line propagation of a mixture of GPs should be considered.
Acknowledgments
We thank Bernhard Schottky for initiating the gaussian process parameterization lemma. The work was supported by EPSRC grant no. GR/M81608.
660
Lehel CsatÂo and Manfred Opper
Appendix A: Properties of Zero-Mean Gaussians
The following property of the gaussian probability density function (pdf) is often used in this article; here we state it in a form of a theorem: Let x 2 Rm and p ( x) zero-mean gaussian pdf with covariance § D f Sij g (i, j from 1 to m). If g: Rm ! R is a differentiable function not growing faster than a polynomial and with partial derivatives, Theorem 1.
@j g ( x ) D
@ @xj
g ( x) ,
then Z Rm
dxp ( x) xi g ( x ) D
m X j D1
Z
Sij
Rm
dxp ( x ) @j g (x ) .
(A.1)
In the following we will assume denite integration over Rm whenever the integral appears. Alternatively, using the vector notation, the above identity reads: Z
Z dxp ( x ) r g ( x) .
dxp (x ) xg (x ) D §
(A.2)
For a general gaussian pdf with mean ¹, above A.2 transforms to Z
Z dxp (x ) xg (x ) D ¹
Proof.
Z dxp ( x ) g ( x ) C §
dxp (x ) r g ( x ).
(A.3)
The proof uses the partial integration rule:
Z
Z dxp ( x ) r g ( x) D ¡
dxg ( x ) r p ( x) ,
where we have used the fast decay of the gaussian function to dismiss one of the terms. Using the derivative of a gaussian pdf, we have Z
Z dxp ( x ) r g ( x) D
dx g ( x ) S ¡1 xp (x ) .
Multiplying both sides with § leads to equation A.2, completing the proof. For the nonzero mean, the deductions are also straightforward.
Sparse On-Line Gaussian Processes
661
Appendix B: Proof of the Parameterization Lemma
Using Bayes’ rule, the posterior process has the form pO (f ) D R
p0 ( f ) P ( D | f ) , dfp 0 ( f ) P ( D | f )
where f is a set of realizations for the random process indexed by arbitrary points from Rm , the inputs for the GPs. We compute rst the mean function of the posterior process: R
Z df pO ( f ) fx D
h fx ipost D D
1 Z
Z dfx
N Y
d f p ( f ) fx P ( D | f ) R 0 df p 0 ( f ) P ( D | f )
dfi p 0 ( fx , f1 , . . . , fN ) fx P ( D | f1 , . . . , fN ) ,
(B.1)
iD 1
where the denominator was denoted by Z and we used index notation for the realizations of the process also (thus, f ( x ) D fx and f ( xi ) D fi ). Observe that regardless of the number of the random variables of the process considered, the dimension of the integral we need to consider is only N C 1; all other random variables will integrate out (as in equation B.1). We thus have an N C 1-dimensional integral in the numerator, and Z is an N-dimensional integral. If we group the variables related to the data as fD D [ f1 , . . . , fN ]T and apply theorem 1 (see equation A.1), replacing xi by fx and g ( x ) by P ( D | f D ) , we have h fx ipost
³ Z 1 h fx i0 dfx df D p 0 ( fx , f D ) P ( D | fD ) D Z ´ Z N X C K0 (x, xi ) dfx df D p0 ( fx , f D ) @i P ( D | fD ) ,
(B.2)
i D1
where K0 is the kernel function generating the covariance matrix (§ in theorem 1). The variable fx in the integrals disappears since it is contained only in p 0 . Substituting back Z leads to h fx ipost D h fx i0 C
N X
K0 ( x, xi ) qi ,
(B.3)
iD 1
where qi is read off from equation B.2, R df D p0 ( fD ) @i P (D | fD ) qi D R , df D p0 ( fD ) P (D | fD )
(B.4)
662
Lehel CsatÂo and Manfred Opper
and the coefcients qi depend only on the data and are independent from x, at which the posterior mean is evaluated. We can simplify the expression for qi by performing a change of variables in the numerator, fi0 D fi ¡ h fi i0, where h fi i0 is the prior mean at xi , 6 and keeping all other variables unchanged fj0 D fj , j D i, leading to the numerator Z dfD p 0 (f 0D ) @i P (D | f10 , . . . , fi0 C h fi i0, . . . , fN0 ) , and the differentiation is with respect to fi0 . We then change the partial differentiation with respect to fi0 with the partial differentiation with respect to h fi i0 and exchange the differentiation and integral operators (they apply to a distinct set of variables), leading to Z @ df0D p 0 (f 0D ) P ( D | f10 , . . . , fi0 C h fi i0 , . . . , fN0 ) . @h fi i0 We then perform the inverse change of variables inside the integral and substitute back into the expression for qi : R @ df D p 0 ( fD ) P ( D | fD ) @h f i i0 R qi D df D p0 ( fD ) P ( D | fD ) Z @ ln dfD p 0 (f D ) P ( D | f D ). (B.5) D @h fi i0 Writing the expression for the posterior kernel, Kpost ( x, x0 ) D h fx fx0 ipost ¡ h fx ipost h fx0 ipost,
(B.6)
and applying theorem 1 twice leads to Kpost ( x, x0 ) D K0 (x, x0 ) C
N X N X i D1 j D1
¡ ¢ K 0 ( x, xi ) Dij ¡ qi qj K0 (xj , x0 ) ,
(B.7)
where Dij is 1 Dij D Z
Z dfD p0 ( fD )
@2 @fj @fi
P ( D | fD ) .
(B.8)
Identifying Rij D Dij ¡ qi qj leads to the required parameterization in equation 2.3 from lemma 1. Simplication of Rij D Dij ¡ qi qj is made by changing the arguments of the partial derivative and using the logarithm of the expectation (repeating steps B.4 and B.5 from qi ), leading to Z @2 ln df D p0 ( fD ) P (D | fD ) , (B.9) Rij D @h fi i0 @h fj i0 and using a single datum in the likelihood leads to the scalar coefcients q ( t C 1) and r ( t C 1) from equation 2.7.
Sparse On-Line Gaussian Processes
663
Appendix C: On-Line Learning in GP Framework
We prove equation 2.8 by induction. We will show that for every time step, we can express the mean and kernel functions with coefcients ® and C given by the recursion (also equation 2.9):
®t C 1 D Tt C 1 ( ®t ) C q ( t C 1) s t C 1 C t C 1 D Ut C 1 ( C t ) C r ( t C 1) st C 1 s TtC 1 s t C 1 D Tt C 1 ( C t k t C 1 ) C e t C 1 ,
(C.1) (C.2)
where ® and C depend only on the data points xi and kernel function K 0 but do not depend on the values x and x0 (from equation 2.8) at which the mean and kernel functions are computed. Proceeding by induction and using the induction hypothesis ®0 D C 0 D 0 for time t D 1, we have a1 (1) D q (1) and C1 (1, 1) D r (1) . The mean function at time t D 1 is h fx i D a1 (1) K0 ( x1 , x) (from lemma 1 for a single datum; see equation 2.6). Similarly, the modied kernel is K1 ( x, x0 ) D K0 (x, x0 ) C K ( x, x1 ) C1 (1, 1)K 0 ( x1 , x0 ) with ® and C independent of x and x0 , proving the induction hypothesis. We assume that at time moment t, we have the parameters ®t and C t independent of the points x and x0 . These parameters specify a prior GP for which we apply the on-line learning:
h fx it C 1 D
t X i D1
( K0 ( xi , x) at ( i) C q t C 1)
( C C q t 1) K0 ( x, xt C 1 )
D
tC 1 X i D1
t X i, j D1
K 0 ( x, xi ) Ct ( i, j) K0 (xj , xt C 1 ) (C.3)
K0 ( x, xi ) at C 1 (i) ,
and by pairwise identication we have equation C.1 or C.2 from the main body. The parameters ®t C 1 do not depend on the particular value of x. Writing down the update equation for the kernels, Kt C 1 ( x, x0 ) D Kt ( x, x0 ) C r ( t C 1) Kt (x, xt C 1 ) Kt ( xt C 1 , x0 ) , leads to equation C.2 in a straightforward manner with Ct C 1 ( i, j) independent of x and x0 , completing the induction. Appendix D: Iterative Computation of the Inverse Gram Matrix
In the sparse approximation equation 3.7, we need the inverse Gram matrix of the BV set: K BV D fK 0 ( xi , xj ) g is needed. In the following, the elements
664
Lehel CsatÂo and Manfred Opper
of the BV set are indexed from 1 to t. Using the matrix inversion formula, 3 the addition of a new element can be carried out sequentially. This is a wellknown fact, exploited also in the Kalman lter algorithm. We consider the new element at the end (last row and column) of matrix K t C 1 . Matrix K t C 1 is decomposed: " K t C1 D
Kt kTtC 1
kt C 1
#
k¤t C 1
.
(D.1)
Assuming K t¡1 known and applying the matrix inversion lemma for K t C 1 , " K t¡1 C1
D
" D
Kt kTtC 1
kt C 1
#¡1
kt¤C 1
K t¡1 C K t¡1 kt C 1 kTtC 1 K t¡1c t¡1 C1
¡K t¡1 kt C 1c t¡1 C1
¡kTtC 1 K t¡1c t¡1 C1
c t¡1 C1
(D.2)
# ,
where c t C 1 D k¤t C 1 ¡ kTtC 1 K t¡1 kt C 1 is the geometric term from equation 3.14. Using notations K t¡1 kt C 1 D eO t C 1 from equation 3.7, K t¡1 D Q t , and K t¡1 C1 D Q t C 1 , we have a recursion equation, " QtC1 D
Q t C c t¡1 O t C 1 eO TtC 1 C1 e
¡c t¡1 eO T C 1 t C1
¡c t¡1 O t C1 C 1e c t¡1 C1
# ,
(D.3)
and in a more compact matrix notation, ( O t C 1 ¡ e t C 1 ) (eO t C 1 ¡ e t C 1 ) T , Q t C 1 D Q t C c t¡1 C1 e
(D.4)
where e t C 1 is the t C 1th unit vector. With this recursion equation, all matrix inversion is eliminated (this result is general for block matrices; such implementation, together with an interpretation of the parameters has been also made in (Cauwenberghs & Poggio, 2001). Using the score (see equation 3.17) and including in BV only inputs with nonzero scores, the Gram matrix is guaranteed to be nonsingular; c t C 1 > 0 guarantees nonsingularity of the extended Gram matrix. For numerical stability, we can use the Cholesky decomposition of the inverse Gram matrix Q . Using the lower triangular matrix R with the cor3 A useful guide to formulas for matrix inversions and block matrix manipulation can be found at Sam Roweis’s home page: http://www.gatsby.ucl.ac.uk/ ~roweis/ notes.html .
Sparse On-Line Gaussian Processes
665
responding indices and the identity Q D R T R , we have the update for the Cholesky decomposition,
³
Rt ¡1/ 2 T ¡c t C 1 eO t C 1
Rt C 1 D
´
0
,
c t C 1/
¡1 2
(D.5)
which is a computationally very inexpensive operation, without additional operations, provided that the quantities c t C 1 and e t C 1 are already computed. Appendix E: Deleting a BV
Adding a basis vector is made with the equations: s t C 1 D Tt C 1 ( C t k t C 1 ) C e t C 1
®t C 1 D Tt C 1 (®t ) C q ( t C 1) st C 1 C t C 1 D Ut C 1 ( C t ) C r (t C 1) s t C 1 sTtC 1
( ( O t C 1 ) ¡ e t C 1 ) ( Tt C 1 ( eO t C 1 ) ¡ e t C 1 ) T Q t C 1 D Ut C 1 ( Q t ) C c t¡1 C 1 Tt C 1 e
(E.1) (E.2) (E.3)
where ® and C are the GP parameters, Q is the inverse Gram matrix,c t C 1 and eO t C 1 the geometric terms of the new basis vector, kt C 1 D [K 0 ( x1 , xt C 1 ) , . . . , K0 ( xt , xt C 1 ) ]T , and e t C 1 is the t C 1th unity vector. The optimal decrease of BV s needs an answer to two questions. The rst question is how to delete a basis vector from the set of basis vectors with minimal loss of information. If the method is given, then we have to nd the BV to remove. The rst problem is solved by inverting the learning equations, E.1 through E.3. Assuming ®t C 1 , C t C 1 , and Q t C 1 known and using pairwise correspondence considering the t C 1th element of ®t C 1 , we def
can identify q ( t C 1) D at C 1 (t C 1) D at¤C 1 (the notations are illustrated in Figure 3). Using similar correspondences for the matrix C t C 1 , the following identications can be done: def
r ( t C 1) D C t C 1 ( t C 1, t C 1) D c¤t C 1 C t kt C 1 D C t C 1 (1.. t, t C 1)
(E.4)
¤ def C t C 1 D ¤ ct C 1
with c¤t C 1 and C ¤t C 1 sketched in Figure 3. Substituting back into equations E.1 and E.2, the old values for GP parameters are: ( )
®t D ®t tC 1 ¡ at¤C 1 ( t)
Ct D Ct C1 ¡
C ¤t C 1
(E.5)
c¤t C 1
C ¤t C 1 C ¤T tC1
c¤t C 1
,
(E.6)
666
Lehel CsatÂo and Manfred Opper ( )
¡1 ( ) where ®t tC 1 D Tt¡1 C 1 ®t C 1 and Tt C 1 is the inverse operator that takes the ( ) ( ) rst t elements of a t C 1-dimensional vector. We dene C t tC 1 D Ut¡1 C1 Ct C 1 similarly. Proceeding similarly, using elements of matrix Q t C 1 , the correspondence with equation E.3 is as follows:
c tC1 D
1 Qt C 1 ( t C 1, t C 1)
eO t C 1 D ¡
1
def
Q t C 1 (1..t, t C 1)
q¤t C 1
D
(E.7)
q¤t C 1
def
Q ¤t C 1
D ¡ ¤ q
,
t C1
O t C 1: with the reduced set matrix Q ( t)
O tC1 D Q Q tC 1 ¡
Q ¤t C 1 Q ¤T tC 1
q¤t C 1
.
(E.8)
The matrix Q does not need any further modication; however, for ® and C , a sparse update is needed. Replacing c t C 1 , eO t C 1 together with the “old” parameters ®t and C t , we have the “optimally reduced” GP parameters as ( )
O t D ®t tC 1 C at¤C 1 ® ( )
O t D C t C c¤C C t 1 t C1
Q ¤t C 1
(E.9)
q¤t C 1
Q ¤t C 1 Q ¤T tC 1
q¤2 tC 1
¡
1 h q¤t C 1
i
¤ ¤T Q ¤t C 1 C ¤T t C 1 C Ct C1 Q t C1 .
(E.10)
References Berliner, L. M., Wikle, L., & Cressie, N. (2000). Long-lead prediction of Pacic SST via Bayesian dynamic modelling. Journal of Climate, 13, 3953–3968. Bernardo, J. M., & Smith, A. F. (1994). Bayesian theory. New York: Wiley. Bottou, L. (1998). Online learning and stochastic approximations. In D. Saad (Ed.), On-line learning in neural networks (pp. 9–42). Cambridge: Cambridge University Press. Cauwenberghs, G., & Poggio, T. (2001). Incremental and decremental support vector machine learning. In T. K. Leen, T. G. Diettrich, & V. Tresp (Eds.), Advances in neural information processing systems, 13. Cambridge, MA: MIT Press. CsatÂo, L., FokouÂe, E., Opper, M., Schottky, B., & Winther, O. (2000). Efcient approaches to gaussian process classication. In S. A. Solla, T. K. Leen, & K.R. Muller ¨ (Eds.), Advances in neural information processing systems. Cambridge, MA: MIT Press. Evans, D. J., Cornford, D., & Nabney, I. T. (2000). Structured neural network modelling for multi-valued functions for wind vector retrieval from satellite scatterometer measurements. Neurocomputing, 30, 23–30.
Sparse On-Line Gaussian Processes
667
Friedman, J. H. (1991). Multivariate adaptive regression splines. Annals of Statistics, 19, 1–141. Gibbs, M., & MacKay, D. J. (1999). Efcient implementation of gaussian processes (Tech. Rep.) Cambridge: Department of Physics, Cavendish Laboratory, Cambridge University. Available at: http://wol.ra.phy.cam.ac.uk/mackay/ abstracts/gpros.html. Jaakkola, T., & Haussler, D. (1999). Probabilistic kernel regression. In Online Proceedings of 7th Int. Workshop on AI and Statistics. Available at: http://uncertainty99.microsoft.com/proceedings.htm. Kimeldorf, G., & Wahba, G. (1971). Some results on Tchebychefan spline functions. J. Math. Anal. Applic., 33, 82–95. Lemm, J. C., Uhlig, J., & Weiguny, A. (2000). A Bayesian approach to inverse quantum statistics. Phys. Rev. Lett., 84, 2068–2071. McLachlan, A., & Lowe, D. (1996). Tracking of non-stationary time-series using resource allocating RBF networks. In R. Trappl (Ed.), Cybernetics and Systems ’96 (pp. 1066–1071). Proceedings of the 13th European Meeting on Cybernetics and Systems Research. Vienna: Austrian Society for Cybernetic Studies. Minka, T. P. (2000). Expectation propagation for approximate Bayesian inference. Unpublished doctoral dissertation, MIT. Available at: vismod.www.media. mit.edu/»tpminka/. Neal, R. M. (1997). Regression and classication using gaussian process priors (with discussion). In J. M. Bernerdo, J. O. Berger, A. P. Dawid, & A. F. M. Smith (Eds.), Bayesian statistics (Vol. 6). New York: Oxford University Press. Available at: ftp://ftp.cs .utoronto .ca/pub /radford/mc-gp.ps.Z. Opper, M. (1996). Online versus ofine learning from random examples: General results. Phys. Rev. Lett., 77(22), 4671–4674. Opper, M. (1998). A Bayesian approach to online learning. In D. Saad (Ed.), On-line learning in neural networks (pp. 363–378). Cambridge: Cambridge University Press. Opper, M., & Winther, O. (1999). Gaussian processes and SVM: Mean eld results and leave-one-out estimator. In A. Smola, P. Bartlett, B. Sch¨olkopf, & C. Schuurmans (Eds.), Advances in large margin classiers (pp. 43–65). Cambridge, MA: MIT Press. Platt, J. (1991). A resource-allocating network for function interpolation. Neural Computation, 3, 213–225. Sch¨olkopf, B., Mika, S., Burges, C. J., Knirsch, P., Muller, ¨ K.-R., R¨atsch, G., & Smola, A. J. (1999). Input space vs. feature space in kernel-based methods. IEEE Transactions on Neural Networks, 10(5), 1000–1017. Seeger, M. (2000). Bayesian model selection for support vector machines, gaussian processes and other kernel classiers. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural informationprocessingsystems,12. Cambridge, MA: MIT Press. Smola, A. J., & Sch¨olkopf, B. (2000). Sparse greedy matrix approxmation for machine learning. In International Conference on Machine Learning. San Mateo, CA: Morgan Kaufmann. Available at: www.kernel-machines.org/ papers/upload 4467 kfa long.ps.gz.
668
Lehel CsatÂo and Manfred Opper
Tipping, M. (2000). The relevance vector machine. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12. Cambridge, MA: MIT Press. Trecate, G. F., Williams, C. K. I., & Opper, M. (1999). Finite-dimensional approximation of gaussian processes. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11. Cambridge, MA: MIT Press. Tresp, V. (2000). A Bayesian committee machine. Neural Computation, 12(11), 2719–2741. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Vijayakumar, S., & Ogawa, H. (1999). RKHS based functional analysis for exact incremental learning. Neurocomputing, 29(1–3), 85–113. Wahba, G. (1990). Splines models for observational data. Philadelphia: SIAM. Williams, C. K. I. (1999). Prediction with gaussian processes. In M. I. Jordan (Ed.), Learning in graphical models. Cambridge, MA: MIT Press. Williams, C. K. I., & Barber, D. (1998). Bayesian classication with gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12), 1342–1351. Williams, C. K. I., & Rasmussen, C. E. (1996). Gaussian processes for regression. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8. Cambridge, MA: MIT Press. Williams, C. K. I., & Seeger, M. (2001). Using the Nystr om ¨ method to speed up kernel machines. In T. K. Leen, T. G. Diettrich, & V. Tresp (Eds.), Advances in neural information processing systems, 13. Cambridge, MA: MIT Press. Received February 23, 2001; accepted May 24, 2001.
Communicated by Erkki Oja
LETTER
Orthogonal Series Density Estimation and the Kernel Eigenvalue Problem Mark Girolami
[email protected] Laboratory of Computing and Information Science, Helsinki University of Technology FIN-02015 HUT, Finland
Kernel principal component analysis has been introduced as a method of extracting a set of orthonormal nonlinear features from multivariate data, and many impressive applications are being reported within the literature. This article presents the view that the eigenvalue decomposition of a kernel matrix can also provide the discrete expansion coefcients required for a nonparametric orthogonal series density estimator. In addition to providing novel insights into nonparametric density estimation, this article provides an intuitively appealing interpretation for the nonlinear features extracted from data using kernel principal component analysis. 1 Introduction Kernel principal component analysis (KPCA) is an elegant method of extracting nonlinear features from data, the number of which may exceed the dimensionality of the data (Sch¨olkopf, Smola, & Muller, ¨ 1996, 1998). There have been many notable applications of KPCA for the denoising of images and extracting features for subsequent use in linear support vector classiers (Sch¨olkopf et al., 1996; Sch¨olkopf, Bruges, & Smola, 1999). Computationally efcient methods have been proposed in Rosipal and Girolami (2001) for the extraction of nonlinear components from a Gram matrix, thus obviating the computationally burdensome requirement of diagonalizing a potentially high-dimensional Gram matrix. 1 In KPCA, the implicit nonlinear mapping from input space to a possibly innite-dimensional feature space often makes it difcult to interpret features extracted from the data. However, by considering the estimation of a probability density function from a nite data sample using an orthogonal
1 The term Gram matrix refers to the N £ N kernel matrix. The terms kernel matrix and Gram matrix may be used interchangeably.
Neural Computation 14, 669–688 (2002)
° c 2002 Massachusetts Institute of Technology
670
Mark Girolami
series, some insights into the nature of the features extracted by KPCA can be provided. Section 2 briey reviews orthogonal series density estimation, and Section 3 introduces the notion of using KPCA to extract the orthonormal features required in constructing a nite series density estimator. Section 4 considers the important aspect of selecting the appropriate number of components that should appear in the series. Section 5 highlights the fact that the quadratic Renyi entropy of the data sample can be estimated using the associated Gram matrix. This strengthens the view that KPCA provides features that can be viewed as estimated components of the underlying data density. Section 6 provides some illustrative examples, and section 7 is devoted to conclusions and related discussion. 2 Finite Sequence Density Estimation The estimation of a probability density by the construction of a nite series of orthogonal functions is briey described in this section. (See Izenman, 1991, and the references in it for a complete exposition of this nonparametric method of density estimation.) We rst consider density estimation using an innite-length series expansion. 2.1 Innite-Length Sequence Density Estimator. A probability density function that is square integrable can be represented by a convergent orthogonal series expansion (Izenman, 1991), such that p (x) D
1 X
ckW k (x) ,
(2.1)
kD 1
where x 2
2 1CN
»
¼ 1 XN 2 (x )g W . n nD 1 k N
Substituting the eigenvector approximations for the eigenfunctions in the above equation gives the following cutoff threshold: n
1T u k
o2
>
2N 2N (uTk u k ) D . 1CN 1CN
(4.2)
If the sample size is large, then 12N ! 2, and the stopping criterion is CN © ª2 simply 1T u k > 2. Much has been written in the literature of nonparametric statistics regarding the stopping criteria for orthogonal series density estimators. (See Diggle & Hall, 1986, for an extensive overview of these.) This section has shown that the eigenvalue decomposition of the Gram matrix (KPCA) provides features that can be used in density function estimation based on a nite sample estimate of a truncated expansion of orthonormal basis functions. One criterion for the selection of the appropriate eigenvectors that will appear in the series has been considered. The following section presents the nonparametric estimation of Renyi entropy from a data sample. The importance of the constructed Gram matrix along with the associated eigenspectrum is considered.
6 This does not take into account the error in approximating the eigenfunctions using the estimated eigenvectors.
Orthogonal Series Estimation
677
5 Nonparametric Estimation of Quadratic Renyi Entropy Thus far, the discussion regarding the form of the kernel appearing in equation 3.1 has been general, and no specic kernel has been assumed. Let us now consider specically a gaussian RBF kernel. Note that the quadratic Renyi entropy, dened as Z p (x) 2 dx,
HR2 (X ) D ¡log
(5.1)
can easily be estimated using a nonparametric Parzen estimator based on an RBF kernel. This above integral formed a measure of distribution compactness in (Friedman & Tukey, 1974) and has recently been used for certain forms of information-theoretic learning (Principe, Fisher, & Xu, 2000). Denoting an isotropic gaussian computed at x centered at ¹ with covariance ¤ as N x (¹, ¤ ), employing the standard result for a nonparametric density estimator using gaussian kernels and noting the convolution theorem for gaussians, the following holds: Z
Z p (x) 2 dx ¼
1 pO (x) 2 dx D 2 N D
8 Z <X N X N :
9 =
N
x (xi ,
¤ )N
iD 1 jD 1
N X N 1 X N N 2 i D1 j D 1
xi (xj ,
x (xj ,
¤ )
;
dx
2¤ ).
For an RBF kernel with a common width of 2¤ , it is clear that the quadratic integral can be estimated from the sum of each element in the Gram matrix— in other words, Z pO (x) 2 dx D
N X N 1 X K (xi , xj ) D 1TN K1N , N2 i D1 j D1
(5.2)
where each N £ 1 vector 1N has each element equal to 1 / N. Now we can see that the contribution to the overall estimated data entropy of each orthonormal component vector can be viewed using an eigenvalue decomposition of the Gram matrix Z pO (x) 2 dx D
N X kD 1
N n o2 X lQ k 1TN u k D EOk. kD 1
It is clear that large contributions to the entropy will come from components © ª2 that have small values of lQ k 1TN u k and can be attributed to elements with little or no structure. This can be considered as the contribution caused by
678
Mark Girolami
observation noise in some cases or diffuse regions in the data. Large values © ª2 of lQ k 1TN u k therefore indicate regions of high density or compactness and are also indicative of possible modes of the density or underlying class and cluster structure. Interestingly, the R integral considered in the computation of the quadratic Renyi entropy pO (x) 2 dx also denes the squared norm of the functional form of pO (x) such that Z
OW¢¹ O Wi D pO (x) 2 dx D k pO k2H D h¹
1 N X N X 1 X li w i (xn ) w i (xm ) N 2 nD 1 m D1 i D1
T T T D 1N K1N D 1N USU 1N D
N X kD 1
EOk.
The above equation provides a more general result in that the kernel used in estimating the quadratic integral need not be restricted to an RBF form. The main point being made here is that the Gram matrix is fundamental to the estimation of the Renyi entropy, which is based on data density. When creating a Gram matrix for extraction of nonlinear features using, for example, KPCA, from the arguments presented in this and the previous section, the key is to choose a kernel that provides a reasonable estimate of the underlying data density. Because different types of kernel produce varying forms of associated eigenfunction, it is clear that the eigenfuctions should be appropriate for the density to be estimated. For example, an RBF kernel has eigenfunctions of the form of normalized Hermite polynomials (Zhu, Williams, Rohwer, & Morciniec, 1998; Williams & Seeger, 2001) and as such would be suitable for distributions with innite support. The following section provides a number of illustrative examples. 6 Simulation The rst simulation provides a two-dimensional illustration of the extracted features from the Gram matrix and how these can be interpreted. Analytic solutions to the one-dimensional form of equation 3.1 have been provided in Zhu et al. (1998) and Williams and Seeger (2001). An RBF kernel is used, and the weighting function is taken as a gaussian, in which case the eigenfunctions take the form of normalized Hermite polynomials (Kreyszig, 1989). Normalized Hermite polynomials form an orthonormal sequence such that in the univariate case, W i ( x) D
1 p
1
(2i i! p ) 2
exp(¡x2 / 2) Hi ( x ) ,
(6.1)
where the Hermite polynomials Hi ( x) are dened by the recursion H0 ( x) D 1 and Hi ( x ) D (¡1) i exp (x2 )
di exp(¡x2 ) . dxi
(6.2)
Orthogonal Series Estimation
679
1.5
1.5
0
0
1.5 1.5
1.5 1.5
0
1.5
0
1.5
Figure 1: A sample of 300 two-dimensional points where 100 are drawn from each of three two-dimensional gaussian clusters. (Left) Scatterplot of the 300 points drawn from the gaussians with an isotropic variance of value 0.1. (Right) The same points with additive gaussian noise whose variance is 0.38.
The generalization of the univariate form to a multivariate representation is straightforward (Tou & Gonzalez, 1974). Consider a sample of two-dimensional data points distributed in a similar manner to those presented in Sch¨olkopf et al. (1998) for illustrative purposes. Three clusters of identical variance with value s D 0.1 and centers ¹ D [0.0, 0.7I 0.7, ¡0.7I ¡0.7 ¡ 0.7] are generated. The left-hand plot of Figure 1 shows the scatter plot of the data. The density of the data corresponds to PK the general form P of mixture such that p (x) D k c k N x ( ¹ k , sI), where the usual constraints k c k D 1 hold. Now note that an orthonormal set of basis functions with respect to the density of the data p (x) must satisfy
K X k D1
Z ck
N x
x ( ¹k,
sI) Qi (x ¡ ¹ k ) Qj (x ¡ ¹ k ) dx D dij .
(6.3)
Because the weighting function is a gaussian, two-dimensional Hermite polynomials of the form Hi (x ¡ ¹ k ) will satisfy this requirement. This indicates that each of the components of the mixture (clusters in this instance) will have a set of orthonormal Hermite polynomial functions associated with them. An RBF kernel of equal width to the individual cluster components was used, and 300 points were drawn from the distribution (see Figure 1). An eigenvalue decomposition was performed on the associated Gram matrix, and the eigenvectors were used in forming the series density estimate.
680
Figure 2 data of Figure 1 are shown starting from the top row. The characteristic structure of two-dimensional orthonormal Hermite polynomial functions is clear and most evident.
Figure 2 shows the nonlinear features associated with each of the rst sixteen eigenvectors. These are clearly estimates of the orthonormal Hermite polynomial expansion coefcients associated with the density. The rst four extracted features for one of the clusters are shown in Figure 3. It is clear that the extracted features are indeed, up to a rotation and scaling, estimates of the orthonormal Hermite polynomial functions (see Figure 3). In essence, what we see is that the nonlinear features extracted by kernel PCA using an RBF kernel are estimates of the orthonormal Hermite polynomial components associated with the underlying data distribution. between the actual density sity estimated using the expectation maximization algorithm) and the KPCA method was computed for a range of the isotropic variance values ranging from 0.05 to 0.5, in 0.05 increments. The empirical KLD was computed
Orthogonal Series Estimation
681
Figure 3 vectors associated with one of the clusters in the data. (Bottom) The rst four two-dimensional orthonormal Hermite polynomial functions.
using the standard form
Three hundred sample points were used to create the Gram matrix and estimate the parameters of the gaussian mixture. Six hundred uniformly distributed points within the region dened in Figure 1 were used to compute the KLD for both methods. The mean KLD for the KPCA method was 0.036 compared to 0.043 for the mixture method over the range of values. The comparison shows similar performance for both methods on this particular two-dimensional data, as would be expected. KPCA density estimation method. The 300 points from the clustered data in the previous simulation had gaussian noise of variance 0.38 added ure 1 shows the original data and the noisy samples. From Figure 4, it is quite obvious that three components are all that is required to estimate the distribution corresponding to the noiseless data. However, in the case of the noisy data, a slowly decaying eigenspectrum can be seen (see Figure 5). However, when examining the contributions to the error of each eigenvector, it is still apparent that there are three signicant generators of the data. The rst three eigenvectors satisfy the Kronmal and Tarter criterion, after which a small number of eigenvectors satisfy the criterion. The right-hand plot of Figure 6 shows the estimated density contour plots when the rst three eigenvectors are retained in the series expansion. This should be contrasted to that which a Parzen window estimator yields (middle plot of Figure 6). The smoothing effect can be noted due to the removal of series elements that capture the diffuse areas within the data and the sharpening of the three modes in the sample.
682
Mark Girolami 60 50 40 30 20 10 0
0
5
10
15
20
25
30
35
40
45
50
0
5
10
15
20
25
30
35
40
45
50
0.08 0.06 0.04 0.02 0
Figure 4: (Top) The rst 50 eigenvalues of the Gram matrix created from the noiseless and well-separated clusters of Figure 1. Due to the distinct structure in the data, there are only three dominant eigenvalues. (Bottom) The contribution to the overall integrated square error of each of the rst 50 eigenvectors. Again it is clear that only three eigenvectors satisfy the Kronmal and Tarter (1968) criterion, and it is these that are required for the density estimate.
The nal illustrative simulation uses data drawn from a uniform distribution with nite support. The left-hand plot of Figure 7 shows the data drawn from a uniform distribution within the annular region that satises 9 · x2 C y2 · 25. The Parzen window estimated density iso-contours are superimposed on these. The adjacent plot in Figure 7 gives the contour plot of the estimated density using the kernel PCA method. Figures 8 and 9 show the related nonlinear features and the relative importance of each. 7 Conclusion Kernel PCA has proven to be an extremely useful method for extracting nonlinear features from a data set, and its utility has been demonstrated
Orthogonal Series Estimation
683
15
10
5
0
0
5
10
15
20
25
30
35
40
45
50
0
5
10
15
20
25
30
35
40
45
50
0 .02 0 .0 15 0 .01 0 .0 05 0
Figure 5: (Top) The rst 50 eigenvalues of the Gram matrix created from the noisy clustered data of Figure 1. It is apparent that the slow exponential decay of the eigenvalues is attributed to the high level of additive noise on the nite number of observations. The fast decay of the previous example is now difcult to discern from the eigenvalues alone. (Bottom) The contribution to the overall integrated square error of each of the rst 50 eigenvectors. It is apparent that there are only three dominant eigenvectors required for the majority of the density estimate. It is also clear that only the rst three eigenvectors satisfy the Kronmal and Tarter (1968) criterion before it is violated, and it is these that are retained for the density estimate.
on, among other applications, many complex and demanding classication problems. An intuitive insight into the nature of these particular features has been somewhat lacking to date. This article has presented an argument that the nonlinear features extracted using KPCA (the eigendecomposition of a Gram matrix created using a specic kernel) provides features that can be considered as components of an orthogonal series density estimate. This follows somewhat from the observations made in Williams and Seeger (2001) regarding the effect of the data distribution on kernel-based classiers.
684
Mark Girolami
Figure 6: (Left) The estimated probability density contour plot of the data consisting of the three noiseless clusters using the kernel PCA-based orthogonal series approach. A Parzen window estimator using a gaussian kernel with a width of 0.1 gives an identical result. (Middle) The contour plots of the density estimate, using a Parzen window estimator, for the noisy data in Figure 1. (Right) The density for the noisy data using an orthogonal series estimator that consists of the rst three eigenvectors of the Gram matrix that satised the Kronmal and Tarter (1968) criterion. A smoothing of the effects of the noise on the density estimate is apparent in this example.
Figure 7: (Left) The scatterplot of 1000 points drawn from a uniform annular ring centered at the origin with uniform width. Superimposed on this are the iso-contours of the estimated probability density using Parzen window estimator. (Right) The iso-contours of the estimated probability density using kernel PCA method. An RBF kernel of unit width was used in this experiment. It is apparent that the uniform region of support has been extended in the density estimate due to the innite support of the RBF kernel. Only eight eigenvectors satised the stopping criterion, and these were retained in the series expansion estimate, which amounts to a representation that uses 0.8% of the possible features.
Orthogonal Series Estimation
685
60 50 40 30 20 10 0 0
5
10
15
20
25
30
35
40
45
50
5
10
15
20
25
30
35
40
45
50
0.02 0.015 0.01 0.005 0 0
Figure 8: (Top) The eigenvalue spectrum for the rst 50 eigenvalues of the ker© ª2 nel. (Bottom) The values of lQ k 1TN u k associated with each eigenvector. Eight of the eigenvectors satisfy the Kronmal and Tarter criterion (1968).
The probability density function estimate provided by the relevant M eigenvectors pOM (x0 ) D 1TN UM UTM k(x0 ) can be seen to be a smoothed Parzen window estimate where the matrix UM UTM acts to smooth the estimate based on the data sample. In some sense, this can be seen as a reduced-set representation of the density function estimate based on the retained eigenvectors of the Gram matrix. The decomposition of the Gram matrix shows how each of the eigenvectors contributes to the overall data entropy (or norm of the functional form of the estimated density). Components that are related to the possible class structure (or modes) have a large sum-squared value, while those that are attributed to unstructured noise have low values. These particular values correspond to the induced error in the series density estimate when the related eigenvectors are discarded. One point of note is the accuracy of the eigenvectors as estimates of the corresponding eigenfunctions and their effect on the density estimate. It is
686
Figure 9 matrix. The characteristic Hermite polynomial structure of the features is most apparent.
noted that the series representaion is very sparse, with only eight components required to form the series density estimator for the data uniformly distributed within the annular region (see Figure 7). This amounts to the removal of 99.2% of the possible components in the series, with the large majority corresponding to the smaller eignvalues being discarded. Williams and Seeger (2001) show that for an RBF kernel, the accuracy of the estimates of the dominant eigenvalues is good, and this deteriorates for the estimation of the smaller values. The important components for the density function estimate are situated at the top end of the eigenspectrum, which do not suffer the effects of poor estimation. sity estimation and provides an insight into the signicance of the associated nonlinear features. This view may prove useful when considering kernel PCA as a means of nonlinear feature extraction for classier design or data clustering and, of course, nonparametric density estimation.
Orthogonal Series Estimation
687
Acknowledgments This work was carried out under the support of the Finnish National Technology Agency TEKES. I am grateful to Harri Valpola and Erkki Oja for helpful discussions regarding this work. Many thanks to the anonymous reviewers for their helpful comments. References Delves, L. M., & Mohamed, J. L. (1985). Computational methods for integral equations. Cambridge: Cambridge University Press. Diggle, P. J., & Hall, P. (1986). The selection of terms in an orthogonal series density estimator. Journal of the American Statistical Association, 81, 230– 233. Friedman, J. H., & Tukey, J. W. (1974). A projection pursuit algorithm for exploratory data analysis. IEEE Transactions on Computing, 23, 881–890. Izenman, A. J. (1991). Recent developments in nonparametric density estimation. Journal of the American Statistical Association, 86, 205–224. Kreyszig E. (1989). Introductory functional analysis with applications. New York: Wiley. Kronmal, R., & Tarter, M. (1968). The estimation of probability densities and cumulatives by Fourier series methods. Journal of the American Statistical Association, 63, 925–952. Mukherjee, S., & Vapnik, V. (1999). Support vector method for multivariate density estimation (AI Memo 1653). Available at: ftp://publications.ai. mit.edu/ai-publications/1500-1999/AIM-1653.ps. Ogawa, H., & Oja, E. (1986). Can we solve the continuous Karhunen-Lo e` ve eigenproblem from discrete data? Transactions of the IECE of Japan, 69(9), 1020–1029. Principe, J., Fisher III, J., & Xu, D. (2000). Information theoretic learning. In S. Haykin (Ed.), Unsupervised adaptive ltering. New York: Wiley. Rosipal, R., & Girolami, M. (2001). An expectation maximisation approach to nonlinear component analysis. Neural Computation, 13(3), 500–505. Sch¨olkopf, B., Bruges, C., & Smola, A. (Eds.). (1999). Advances in kernel methods— Support vector learning. Cambridge, MA: MIT Press. Sch¨olkopf, B., Smola, A., & Muller, ¨ K. R. (1996). Nonlinear component analysis as a kernel eigenvalue problem (Tech. Rep. MPI TR. 44). Available at: http://www.kernel-machines.org Sch¨olkopf, B., Smola, A., & Muller, ¨ K. R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5), 1299–1319. Tou, J. T., & Gonzalez, R. C. (1974). Pattern recognition principles. Reading, MA: Addison-Wesley. Williams, C. K. I., & Seeger, M. (2000). The effect of the input density distribution on kernel-based classiers. In P. Langley (Ed.), Proceedings of the Seventeenth International Conference on Machine Learning, 2000 (pp. 1159–1166). San Mateo, CA: Morgan Kaufmann.
688
Mark Girolami
Williams, C. K. I., & Seeger, M. (2001). Using the Nystr om ¨ method to speed up kernel machines. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 682–688). Cambridge, MA: MIT Press. Zhu, H., Williams, C. K. I., Rohwer, R. J., & Morciniec, M. (1998). Gaussian regression and optimal nite dimensional linear models. In C. M. Bishop (Ed.), Neural networks and machine learning. Berlin: Springer-Verlag. Received September 26, 2000; accepted May 30, 2001.
LETTER
Communicated by Carsten Peterson
Natural Discriminant Analysis Using Interactive Potts Models Jiann-Ming Wu
[email protected] Department of Applied Mathematics, National Donghwa University, Shoufeng, Hualien 941,Taiwan, Republic of China Natural discriminant analysis based on interactive Potts models is developed in this work. A generative model composed of piece-wise multivariate gaussian distributions is used to characterize the input space, exploring the embedded clustering and mixing structures and developing proper internal representations of input parameters. The maximization of a log-likelihood function measuring the tness of all input parameters to the generative model, and the minimization of a design cost summing up square errors between posterior outputs and desired outputs constitutes a mathematical framework for discriminant analysis. We apply a hybrid of the mean-eld annealing and the gradient-descent methods to the optimization of this framework and obtain multiple sets of interactive dynamics, which realize coupled Potts models for discriminant analysis. The new learning process is a whole process of component analysis, clustering analysis, and labeling analysis. Its major improvement compared to the radial basis function and the support vector machine is described by using some articial examples and a real-world application to breast cancer diagnosis. 1 Introduction
The task of discriminant analysis (Hastie & Simard 1998; Hastie & Tibshirani 1996; Hastie, Tibshirani, & Buja, 1994; Hastie, Buja, & Tibshirani, 1995) aims to achieve a mapping function from a parameter space Rd to a set of discrete and nonordered output labels S subject to interpolating conditions proposed by training samples f(xi , qi ) , 1 5 i 5 N, xi 2 Rd , qi 2 Sg. S is represented by M fe1M , e M 2 , . . . , e M g for the discrete and nonordered property of output labels, where M denotes the number of categories and e M m is a unitary vector of M elements with only the mth bit one. The category of an item is predicted by d measurements of features x 2 Rd . The following design cost quantitatively measures the tness of all training samples to a mapping function g: Rd ! S, DD
X
L ( g ( xi ) , qi ) ,
(1.1)
1 i N
Neural Computation 14, 689–713 (2002)
° c 2002 Massachusetts Institute of Technology
690
Jiann-Ming Wu
where L (¢) denotes an arbitrary distance between two unitary vectors. A mapping function absolutely minimizing the design cost automatically satises all interpolating conditions proposed by training samples; however, such a mapping function can be obtained by simply recording all samples in a look-up table. It is not the purpose of supervised learning, since without any other objective, a learning process subject only to the design cost is an ill-posed problem. Coming from a future validation by testing samples, the generalization cost proposes another essential criterion. Both the training set and the testing set are assumed to have the same underlying input-output relation. An effective supervised learning process is expected to minimize the design cost as well as the generalization cost. The minimization of this generalization cost has been considered as the reduction of model size in the eld of statistics (Hastie & Simard, 1998; Hastie & Tibshirani, 1996; Hastie et al., 1994, 1995), which is used for maximal generalization in this work. The derivation of our new learning process starts with a generative model responsible for characterizing the parameter space on the basis of a nonoverlapping partition. Assume that there exist K internal regions Vk , 1 5 k 5 K, in the partition; each region Vk is centered at yk and is dened by V k D fx| arg minj kx ¡ yj kAk D k, x 2 Rd g, where kxk A denotes the p Mahalanobis distance of x0 Ax. The local distribution of the input parameter in each region V k is modeled by a multivariate gaussian distribution with a mean vector at the center yk and a covariance matrix A k. As a special case, all local generative models are assumed to have the same covariance the A in this work to facilitate our presentations. The kernels fy kg partition S space RdT into K nonoverlapping subspaces with the property of k Vk D Rd 6 k2 . The tness of all parameters in each interand V k1 V k2 D ; for all k1 D nal region V k to the corresponding local generative model is quantitatively measured by a log-likelihood function. In this work, the supervised learning process for discriminant analysis is a process of maximizing the sum of all log-likelihood functions and minimizing the design cost. This work insists on solving the task of discriminant analysis by collective decisions performed by the architecture of neural networks. The formulation of the above two objectives involves two kinds of variables: discrete combinatorial variables and continuous geometrical variables. The resulting optimization framework is a mixed integer and linear programming, of which the optimization is difcult for the gradient-descent method due to numerous shallow local minimum within the corresponding energy function. The Potts encoding, which possesses exibility in internal representations and reliability in collective decisions, is employed to deal with this computational difculty. The Potts encoding is suitable for the design of neural networks and has been applied to fundamental complex tasks, including combinatorial optimizations (Peterson & Soderberg, ¨ 1989), selforganization (Liou & Wu, 1996; Rose, Gurewitz, & Fox, 1990, 1993), classication, and regression (Rao, Miller, Rose, & Gersho, 1999). The multistate
Natural Discriminant Analysis
691
Potts neuron, generalized from the two-state spin neuron, is used to reduce the search space of feasible congurations and realize the problem modeling for effective internal representations. In this work, the combinatorial internal representations include the assignment of each input parameter xi to one and only one internal region, denoted by the membership vector di 2 feK1 , . . . , eKK g, and the dynamical assignment of each region V k to one M output label, denoted by the category response j k 2 feM 1 , . . . , eM g. Each di or j k is considered as the discrete Potts neural variable. The continuous geometrical variables include the kernels fykg and the common covariance matrix A. By these representations, the maximization of the sum of all loglikelihood functions, each measuring the nesse of the corresponding local generative model, and the minimization of the design cost together form a mixed integer and linear programming and lead to a novel energy function for discriminant analysis. All these variables—fdi g, fj kg, fy kg, and A— are collectively optimized by a hybrid of the mean-eld annealing and the gradient-descent method toward the minimization of the energy function. The resulting learning process consists of four sets of interactive dynamics characterizing the coupled Potts models of discriminant analysis. The evolution of the four sets of interactive dynamics is well controlled by an annealing process for the minimization of the energy function. The annealing process is analogous to physical annealing, which is a process of gradually and carefully scaling the temperature from a sufciently large scale to a small one. At each temperature, the mean conguration of the whole system is a balancing result of trading off minimizing the mean energy against maximizing the entropy. When this process is used, mean activations of a Potts neuron, indicating probabilities of active states, are increasingly inuenced by injected mean elds. At the beginning, mean activations are independent of injected mean elds; the system is ruled over by the principle of maximal entropy; a Potts neuron has almost the same probability of activating each of its states. As the process progresses, the symmetry is broken; each Potts neuron has a decreasing degree of freedom; the mean conguration of the system is increasingly dominated by the tendency toward the minimal mean energy and decreasingly by the criterion of maximal entropy. Toward the end of the process, the mean conguration is totally controlled by the force of minimal mean energy; mean activations of a Potts neuron behave winner-take-all. The Potts encoding has been applied to model collective decisions (Liou & Wu, 1996; Peterson & Soderberg, ¨ 1989). Its applicability to the task of discriminant analysis is explored in this work.
1.1 The Learning Network. The learning process is composed of four sets of interactive dynamics, which constitute the coupled Potts model in architecture. The coupled Potts model is a modular recurrent neural network
692
Jiann-Ming Wu
Figure 1: The learning network is composed of four interactive dynamics and the interconnection network.
for supervised learning. As shown in Figure 1, the coupled Potts model consists of four interactive modules, of which two are Potts neural networks calculating the mean hdi and hj i of combinatorial variables fdi g and fjkg, respectively, and the others are linear networks for updating kernels fykg and the covariance matrix A. Four interactive modules communicate with each other through interconnection networks. The same learning network has appeared in implementing the recurrent backpropagation (Pineda, 1987, 1989) with two modules. 1.2 The Discriminant Network. The discriminant network derived by the new learning process is closely related to a network of multilayer perceptrons (MLP) (Rumelhart & McClelland, 1986; Pineda, 1987, 1989) and radial basis functions (RBF) (Benaim, 1994; Freeman & Saad, 1995; Girosi, Jones, & Poggio, 1995; Girosi, 1998; Moody & Darken, 1989). It is a network of normalized radial basis functions (RBFs) with generalized hidden units. The hidden units of a normalized RBF network (Moody & Darken, 1989) use a normalized gaussian activation function,
GIk ( x)
® ®2 exp( ®x ¡ y k® / 2s 2 ) D P , ® ®2 ® ® / 2s 2 ) j exp( x ¡ yj
(1.2)
Natural Discriminant Analysis
693
with a scalar variance s 2 . The hidden units of the current recognizing network have a normalized multivariate gaussian activation function, exp(¡b ( x ¡ y k ) 0 A ( x ¡ y k ) ) ( ) P GA , k x D ( )0 ( )) j exp(¡b x ¡ yj A x ¡ yj
(1.3)
where A is a covariance matrix and b is the inverse of an articial temperature process. By normalization, we mean that P A used in the annealing I ( ) is a special case of A ( ) whereas the func( ) 1 for any G x D x. G x Gk x , j j k A I tion Gk ( x) can be translated to the form of G k ( z) if one rewrites the term 0 ( x ¡y k ) 0 A ( x ¡y k ) as (z ¡zk ) 0 ( z ¡zk ) with zk D Byk and z D Bx, where B B D A. A By this translation, the function Gk ( x) is decomposed as the combination of z D Bx and GIk ( z) . The current recognizing network is exactly the composition of a linear transformation and a normalized RBF network. The learning process derived in this work can be directly applied to a normalized RBF network by xing the covariance matrix as I, and it is also applicable to an MLP network based on the connection (Girosi et al., 1995; Girosi, 1998) between a normalized RBF network and an MLP network. Practical experiments (Miller & Uyar, 1998) have shown the gradient-descent-based learning algorithms, including the backpropagation algorithm for an MLP network and the learning algorithm (Moody & Darken, 1988, 1989) for an RBF network, suffering at the trap of tremendous local minima in optimizing their internal representations. Based on a hybrid of the mean-eld annealing and the gradient-descent method, the new learning process proposed in this work is essential for developing effective nonlinear boundaries, well optimizing the internal representation for the parameter space of a real application. The normalized multivariate gaussian activation function in equation 1.3 denes an overlapping partition into the parameter space, where the overlapping degree is modulated by the b parameter, indirectly by the anneal( ) as the projection probability ing process. We consider the function GA k x of assigning an input parameter x to an internal region V k. For an input ( ) 1 5 k 5 Kg characterize parameter x, all its projection probabilities fGA k x , different partition phases; at a sufciently low b, they are nearly identical to K1 , denoting a complete overlapping partition. As the b value increases, they become asymmetric for some degree of overlapping partition. To a sufciently large b, they behave winner-take-all, such that the only winner ( ) is one and the others are zero, where k¤ D arg mink kx ¡ y kkA . If GA k¤ x each region Vk is equipped with an optimal category response j k, based on the above projection mechanism, the mapping function of the current recognizing network is g (x) D
X k
(x) j k. GA k
(1.4)
694
Jiann-Ming Wu
This discriminant function at a sufciently large b is similar to the nearest prototype classier, but with a generalized distance measure k ¢ k A , g ( x ) D j k¤ ,
® ® k D arg min ®x ¡ y k® A , ¤
(1.5)
k
and can be further translated to a composition of g ( z ) D j k¤ ¤
(1.6)
k D arg min kz ¡ z kk and z D Bx,
k
(1.7)
where zk D By k and A D B0 B. If A D I, the latter form exactly denes the nearest prototype classier (Rao et al., 1999), of which the measure is the Euclidean distance. The partition of fzkg into the parameter space is nonoverlapping, and each internal region is attached with its own category response. The nearest prototype classier is known to be suitable for the case with statistically independent components, but for most real applications, this assumption is not valid, so a preprocessor for feature extraction like the above linear transformation, z D Bx, is usually additionally employed. But the development of a preprocessor, such as using independent component analysis (Lin, Grier, & Cowan, 1997; Makeig, Jung, & Bell, 1997) or principal component analysis, is independent of the formation of a classier; the combined discriminant function may suffer from the inconsistency between the extracted feature and the classier. Alternatively, based on the Mahalanobis distance, the new learning process in this work focuses on the mapping function in equation 1.5 and directly explores the whole discriminant process of component analysis, clustering analysis, and labeling analysis. In the next section, we introduce the generative model for characterizing the parameter space and derive a mathematical framework for discriminant analysis. Four sets of interactive dynamics and the coupled Potts model are developed in section 3. Another issue in this work is incremental learning, a procedure for determining the optimal number of internal regions or the minimal model size for maximal generalization. The incremental learning scheme is introduced in section 4. In the nal section, we test the new method in comparisons with RBF (Muller et al., 1999; Ratsch, Onoda, & Muller, 2001) and the support vector machine (Vapnik, 1995; Platt, 1999; Cawley, 2000) using articial examples and a real-world application to breast cancer diagnosis (Wolberg & Mangasarian, 1990; Malini Lamego, 2001), and discussions about simulation results are described.
Natural Discriminant Analysis
695
2 Supervised Learning for Discriminant Analysis 2.1 The Generative Model. The generative model for a parameter space Rd is composed of K piece-wise multivariate normal distributions. Each is of ³ ´ 1 1 0 ( ) ( ) q exp ¡ ¡ ¡ (2.1) p k ( x) D x y A x y k k d 2 (2p ) 2 A¡1
centered at the vector y k with a nonsingular covariance matrix A. The K distributions have been assumed to have the same covariance matrix. These kernels fy kg form K internal S regions fV kg in theTparameter space, which are 6 k2 . For nonoverlapping, such as k V k D Rd and V k1 Vk2 D ; for all k1 D each internal region Vk, the tness of p k (x ) to all input parameters xi 2 V k is measured proportional to the following log-likelihood function: Y (2.2) lk D log p k ( xi ) . xi 2Vk
A summation of all lk leads to the following function: X lD lk k
D D
X
Y
log
pk ( xi )
xi 2Vk
k
X X
log pk (xi ) .
(2.3)
k xi 2Vk
Recall that the assignment of each input xi to one of K internal regions has been represented by a P membership vector di , of which each element dik is either one or zero and k dik D 1. The function l can be rewritten as XX dik log pk ( xi ) lD i
k
1 XX dik ( xi ¡ y k ) 0 A ( xi ¡ y k ) D ¡ 2 i k ¡
N Nd log det(A ¡1 ) ¡ log(2p ) , 2 2
(2.4)
where det(¢) denotes the determinant of a matrix. By the fact det(A ¡1 ) D ¡ det(A) and neglecting the last constant term, we obtain the following objective: E1 D
1 XX N dik ( xi ¡ y k ) 0 A ( xi ¡ y k ) ¡ log det(A). 2 i k 2
(2.5)
Maximizing the function l is equivalent to minimizing the function E1 .
696
Jiann-Ming Wu
2.2 A Mathematical Framework. Recall that each region Vk has been M attached with a category response j k 2 feM 1 , . . . , eM g for classication. The design cost in equation 1.1 can be expressed as
X 1X | |qi ¡ dikj k || 2 2 i k 1X | |qi ¡ Ldi || 2 , D 2 i
E2 D
(2.6)
where the matrix L D [j1 , . . . , j k, . . . , jK ]. By combining the two objectives of equations 2.5 and 2.6 and injecting all constraints, we have the following mathematical framework for discriminant analysis. Minimize E (d , j, y, A ) D E1 C cE2 1 XX dik ( xi ¡ yk ) 0 A ( xi ¡ yk ) D 2 i k N cX ||qi ¡ Ldi || 2 , ¡ log det(A) C 2 2 i
(2.7)
(2.8)
subject to X k
X m
dik 2 f0, 1g, for all i, k dik D 1, for all i
j km 2 f0, 1g, for all k, m
(2.9)
d km D 1, for all k,
where d, j , and y denote collections of fdi g, fj kg, and fykg respectively, and c is a weighting constant. The learning process for discriminant analysis turns to search for a set of d, j , y, and A, which minimize the weighted sum of the negative log-likelihood function and the design cost subject to a set of constraints as in equation 2.9. We consider the mathematical framework in equations 2.8 and 2.9 as a mixed integer and linear programming, of which fyk g and A are continuous geometrical variables and fdi g and fjk g are discrete combinatorial variables. In the following section, we employ a hybrid of the mean-eld annealing and gradient-descent methods to the optimization of all these variables simultaneously. 3 Interactive Dynamics and Coupled Potts Models
A hybrid of the mean-eld annealing and the gradient-descent methods is applied to the above mixed integer and linear programming. As a result,
Natural Discriminant Analysis
697
four sets of interactive dynamics are developed for the variables A, fy kg, fdi g, and fj kg, respectively. These dynamics interact following an analogous process of the physical annealing and perform a parallel and distributed learning process for discriminant analysis. When we relate each vector di or j k to a Potts neuron, the two unitary constraints in equation 2.9 are subsequently taken over by the normalization of the Potts activation function. The above mathematical framework is reduced to the minimization of the energy function E. By xing the matrix A and fy kg, the mean-eld annealing traces the mean conguration < d > and < j > , emulating thermal equilibrium at each temperature. It follows that the probability of the system conguration is proportional to the Boltzmann distribution: Pr(d, j ) / exp(¡bE(d, j ) ) .
(3.1)
Following the annealing process to a sufciently large b value, the Boltzmann distribution is ultimately dominated by the optimal conguration, lim Pr (d ¤ , j ¤ ) D 1,
b !1
where E (d ¤ , j ¤ ) D min E (d , j ). d,j
The annealing process gradually increases the parameter b from a sufciently low value to a large one. At each b value, the process iteratively executes the mean-eld equations to a stationary point, which represents the mean conguration for thermal equilibrium. The obtained mean conguration at each b value is used as the initial conguration for the process at its subsequent b value. The mean-eld equation can be derived from the following free energy function, which is similar to that proposed by Peterson and Soderberg ¨ (1989), Y ( y, A, hdi , hj i , v, u ) XX XX hdik i vik C hj km i u km D E ( y, A, hdi , hj i) C i
³
X 1X ¡ ln exp(bvik ) b i k
k
´
k
³
m
(3.2)
´
X 1X ¡ ln exp(bu km ) , b m
(3.3)
k
where hdi, hj i, u, and v denote fdi g, fj kg, fu km g, and fvikg, respectively, and ui and v k are auxiliary vectors. When xing y, A and b, a saddle point of the
698
Jiann-Ming Wu
free energy satises the following condition: @Y @ hdi i @Y
@ hjk i
@Y
D 0,
@vi @Y
D 0,
@u k
D 0, for all i D 0, for all k.
These lead to two sets of mean-eld equations in the following vector form: vi D ¡
@E ( y, A, hdi , hj i)
(3.4)
@ hdi i
1 2
D ¡ ( xi ¡ yk ) 0 A ( xi ¡ yk ) ¡ cL 0 ( qi ¡ Ldi )
µ
hdi i D
exp(bvi1 ) exp(bviK ) P ,..., P ) exp(bv ih h h exp(bvih )
uk D ¡ D c
µ hj k i D
¶0
(3.5)
@E ( y, A, hdi , hj i)
X i
(3.6)
@ hjk i
hdik i (qi ¡ L hdi i)
exp(bu k1 ) exp(bukM ) P , ..., P ) exp(bu km m m exp(bu km )
¶0
.
(3.7)
During the stage of evaluating the mean conguration at each b value, the matrix A and the kernels fykg are considered constants. Once determined, the mean conguration feeds back to the adaptation of the covariance matrix and the kernels. By applying the gradient-descent method to the free energy, we have the following updating rule for each element Amn in the matrix A: @Y
4Amn / ¡
@Amn
D ¡
@Amn
D ¡
1 XX N hdik i ( xim ¡ y km ) ( xin ¡ ykn ) C [ ( A0 ) ¡1 ]mn . 2 i k 2
@E
(3.8)
When all 4Amn D 0, we have A D ( W ¡1 ) 0 ,
(3.9)
Natural Discriminant Analysis
699
where Wmn D
1 XX hdik i ( xim ¡ y km ) ( xin ¡ y kn ) . N i k
(3.10)
The adaption of the kernels fyk g is also derived by the gradient-descent method: 4y k / ¡ D
@Y @y k
(3.11)
1X hdik i ( A C A0 ) ( xi ¡ y k ) . 2 i
Again, when 4y k D 0, we have P hdik i xi . y k D Pi i hdik i
(3.12)
Now we conclude the new learning process for discriminant analysis as follows: 1. Set a sufciently low b value, each kernel y k near the mean of all 1 predictors and each hdik i near K1 and hjkm i near M . 2. Iteratively update all hdik i and vik by equations 3.4 and 3.5, respectively, to a stationary point.
3. Iteratively update each hj km i and u km by equations 3.6 and 3.7, respectively, to a stationary point. 4. Update each y i by equation 3.12. 5. Update A by equations 3.9 and 3.10. P P 6. If ikhdik i2 and km hj km i2 are larger than a prior threshold, then halt; otherwise increase b by an annealing schedule and go to step 2. The convergence of the algorithm is well guaranteed. For steps 2 and 3, two sets of mean-eld equations dene a stationary point of the free energy, equation 3.2. For steps 4 and 5, since all fhdik ig and fhj km ig are xed, the change D y of the free energy, equation 3.2, due to the change D y k and D Amn has the nonincreasing property supported by the gradient-descent method. A mathematical treatment to the convergence property is given in the appendix. The two sets of mean-eld equations, 3.4–3.5 and 3.6–3.7, constitute two interactive Potts models. The learning network for the whole learning process is shown in Figure 1.
700
Jiann-Ming Wu
4 Incremental Learning for Optimal Model Size
An incremental learning procedure is developed for the interactive Potts models to optimize the model size subject to the criterion of minimal design cost. The model size indicates the number of kernels. According to the learning process in the last section, when the annealing process halts with a sufciently large b, each individual mean hdik i or hjkm i is close to either one or zero and the two square sums of mean activations are larger than the predetermined threshold. The variables fhdik ig and fhj km ig are rst recorded ¤ ¤ by temporal variables fdik g, fj km g, respectively, to establish an intermediate recognizing network with kernels fy kg and the matrix A. Whether the model size of this intermediate recognizing network, which is the size of fyk g, is sufP P ¤ || 2 cient depends on the quantity of the design cost i ||qi ¡ k j k¤dik , which denotes the number of errors of predicting all training samples. Dene the P ¤ k2 . If the hit ratio is not acceptable, such as hit ratio r as 1 ¡ N1 Si kqi ¡ k j k¤dik r < h and h D 0.98, it is conjectured that the underlying boundary structure for well-discriminating training samples overloads the partition formed by the kernels fy kg and the covariance matrix A. The incremental learning procedure aims to improve this hit ratio r by properly increasing the model size and the boundary structure. Dene the local hit ratio r k as P adapting ¤ 1 ¡ N1k i dik kqi ¡j k¤ k for each internal region Vk D fx|k D arg minj kx¡yj kA g, where N k denotes the size of the set fxi 2 V kg. A set of underestimated internal regions, each having an unacceptable local hit ratio, such as rk < h, is then picked out and their kernels are duplicated. The duplication involves the variation of the original coupled Potts model in organization. The idea of divide-and-conquer tends to invoke a subtask for each underestimated region and then deal with each subtask independently. This is not the best choice, since it loses the point of global optimization of the boundary structure for discriminant analysis. Alternatively, the incremental learning procedure makes use of collective decisions of interactive Potts models. The kernel y k of an underestimated internal region, such as rk < h, is duplicated with small perturbation to produce its twin kernel denoted by y¤k . A set of new kernels fynew g is created to be the union of fy kg and fy¤k | k rk < hg, having the model size of K C K¤ , where K and K¤ , respectively, denote the number of the original kernels and that of the underestimated internal regions. In the new set, let the index of y k be still k and that of y¤k be denoted by k0 . For each input parameter xi , assuming xi 2 V ( yk ), a new membership vector dinew , now with K C K¤ element, is created as follows: ¤
dinew D eKk C K if rk ¸ h ² 1 ± K C K¤ C K¤ C eK otherwise. D ek 0 k 2 In the second line of the above equation, the new vector dinew means that xi has the same probability 12 of being mapped to each of the twins, ynew and k
Natural Discriminant Analysis
701
. Create new category responses fj knew g for the new kernels fynew g, and set ynew k0 k 1 each element of j knew near M . After duplication and replacing all means with new vectors fdinew g, fjknew , 1 5 k 5 K C K¤ g and kernels with fynew g, the system k variables include fyk , 1 5 k 5 K C K¤ g, fhdi ig and fhjk i, 1 5 k 5 K C K¤ g. Note that now each hdi i contains K C K¤ elements. Return to the annealing process. The hasP been increased to a sufciently large one, where the Pb value ¤ ) 2 and ¤ ) 2 are larger than the predetermined halting (j km two sums (dik threshold, but these system variables after duplication no more satisfy the halting condition. The b value can be further increased to continue the annealing process. We conclude the incremental learning process as follows for the interactive Potts models: 1. Set a sufciently low b value, a threshold h, and an initial model size K. Set each kernel y k near the mean of all predictors, each hdik i near K1 , 1 and hj km i near M . 2. Iteratively update all hdik i and vik by equations 3.4 and 3.5, respectively, to a stationary point. 3. Iteratively update each hj km i and u km by equations 3.6 and 3.7, respectively to a stationary point. 4. Update each yi by equation 3.12. 5. Update A by equations 3.9 and 3.10. P P 6. If ik hdik i2 and km hj km i2 are larger than a prior threshold, such as 0.98, then go to step 7. Otherwise, increase b by an annealing schedule, and then go to step 2. 7. Record fhdi ig, fhj k ig by temporal variables fdi¤ g, fj k¤ g, respectively. 8. Determine r and all r k using A, fy kg, fdi¤ g, fj k¤ g.
9. If r > h, halt. Otherwise, duplicate the kernels of K¤ underestimated regions with small perturbation and create new variables fynw g, fdinew g, k new ¤ ¤ ¤ fj k g using fykg, fyk |rk < hg, fdi g, fj k g, as described in the text.
10. K Ã K C K¤ . Decrease b with a small constant, and replace fy kg, fhdi ig, and fhj k ig with fynew g, fdinew g, and fj knew g respectively, and go to step 2. k 5 Numerical Simulations and Discussion
The incremental learning process in section 4 has been implemented in Matlab codes and is referred to as PottsDA in the following context. 5.1 Examples. We rst test the new method (PottsDA) in comparisons with the RBF and support vector machine (SVM) methods (Vapnik, 1995) using some articial examples. In our simulations, the b parameter of the 1 PottsDA is initialized as 3.8 , and each annealing process increases it to a
702
Jiann-Ming Wu
b value of 0.98 I the weighting constant is c D 4.5. Theoretical derivations of the weighting constant c and the initial b parameter can further refer to Aiyer, Niranjan, and Fallside (1990) and Peterson and Soderberg ¨ (1989), respectively. The Matlab package used for the RBF (Muller et al., 1999; Ratsch et al., 2001) is provided in Ratsch et al. (2001), where the centers are initialized with k-means clustering. For the SVM (Vapnik, 1995), the Matlab package is provided in Cawley (2000), and the corresponding learning method is of sequential minimal optimization (Platt, 1999). In this section, the three methods are executed ve times for each example. Their average error rates for training and testing are reported. The input parameters in the rst example are generated by a linear mixture, x ( t) D Hs ( t), where s ( t) D [s1 ( t) s2 ( t)]0 denotes time-varying samples from two independently uniform distributions within [¡.05, 0.5], and " # 0.4384 ¡0.8988 HD ¡0.8493 0.5279
is a randomly generated mixing matrix. The desired output of each input parameter is determined by the rule q ( t) D sign ( s1 (t ) ) ¤sign(s2 ( t) ) . We use the same process to generate 1600 samples and split them into two equal sets— one for training and the other for testing. In Figure 2, the position of each input parameter in the training set is marked with a black or gray dot, which, respectively, denote two distinct output labels. Since the input parameter in this example is a linear mixture of independent sources, a discriminant rule depending on only the input parameter, such as sign (x1 (t ) ) ¤sign(x2 ( t) ) , does not describe an optimal prediction for correct output labels. The primary challenge to the learning process is the recovery of the original independent sources such that an effective discriminant rule can be encoded by a minimal set of kernels to achieve maximal generalization. Our simulations show that the PottsDA method outperforms the RBF and the SVM methods for this example. The PottsDA has an initial model of two kernels and halts with an optimal hit ratio of r D 100%. As shown in Table 1, the error rates of the PottsDA for both training and testing are zero. This is a result carried out by a discriminant network composed of a covariance matrix, four kernels and their category responses. In Figure 2, the position of each of four kernels is marked with a circle or cross symbol representing the category denoted by black or gray dots, respectively. By the relation B0 B D A, one can obtain a demixing matrix, " # 12.0370 5.8743 BD 5.8743 10.2847 from the covariance matrix A. The two columns of the inverse B ¡1 are shown by the two lines in Figure 2, which exactly coincide with the mixing structure in direction for this example. The obtained covariance matrix provides
Natural Discriminant Analysis
703
0.8
0.6
0.4
0.2
0
-0 . 2
-0 . 4
-0 . 6 - 0. 8
- 0 .6
-0.4
-0 . 2
0
0.2
0.4
0.6
0.8
Figure 2: The training patterns of the rst example and the result of the learning process, including the two columns of the inverse of the demixing matrix, the four kernels, and their category responses. Table 1: Performance of the Three Methods, First Example. RBF(4)
RBF(8)
RBF(12)
RBF(24)
SVM
PottsDA(4)
Training
14.1%
12.0%
8.6%
3.9%
13.2%
0%
Testing
13.0
12.1
8.3
4.5
14.3
0
Note: Numbers in parentheses refer to number of kernels.
a suitable distance measure between input parameters and kernels such that the four kernels faithfully partition the parameter space into four internal regions and the resulting discriminant network successfully classies samples in the training set and the testing set. In contrast, since the RBF network is based on the Euclidean distance, its kernels result in nonfaithful representations for input parameters. To illustrate this point, the RBF method was tested separately with 4, 8, 12, and 24 kernels. Our simulations show that the average error rate of the RBF method with 24 kernels is 3.9% for training and 4.5% for testing and that of the SVM method is 13.2% for training and 14.3% for testing. The average execution time of the PottsDA for this example is 13.4 seconds. In the second example, the input parameter contains three elements. Each of the input parameters, x ( t) D [x1 (t) x2 (t ) x3 ( t) ]0 , is a result of the linear
704
Jiann-Ming Wu
mixture, x ( t) D Hs ( t) , of three independent sources, s ( t) D [s1 ( t) s2 ( t) s3 ( t) ]0 , where H is a randomly generated mixing matrix with entries as follows: 2 3 0.9288 0.2803 0.3770 6 7 H D 4 0.3122 0.9366 0.2572 5 . 0.1994 0.2098 0.8897 The rst two sources, s1 (t) and s2 ( t) , are of uniform distributions within p [¡0.5, 0.5], and s3 (t ) is a gaussian noise of N (0, 2) . The discriminant rule is the same as in the rst example, sign ( s1 ( t) ) ¤ sign (s2 ( t) ), treating the third source as a noise for prediction. To retrieve this discriminant rule from the mixture with the minimal model size, the learning process has to deal with interference caused by the mixing structure and the noise source. As in the rst example, both the training set and the testing set contain 800 samples each generated by the same linear mixture. For the testing set, all samples of three independent sources are shown in the rst three rows in Figure 3, and the three mixed signals, x1 ( t) , x2 (t ), and x3 ( t) , are shown by the next three rows. The seventh row in Figure 3 shows the desired output of each
Figure 3: The time sequence of the three independent sources, the three input parameter, the desired output, and the predicted output of the second example.
Natural Discriminant Analysis
705
Table 2: Performance of the Three Methods, Second Example. RBF(4)
RBF(8)
RBF(12)
RBF(24)
SVM
PottsDA(4)
Training
45.3%
31.2%
22.6%
10.9%
3.2%
0.2%
Test
44
31.1
24.6
13.9
5.9
0
Note: Numbers in parentheses refer to number of kernels.
sample. In all of ve executions, the PottsDA derives a discriminant network composed of a distance measure, four kernels and their category responses, by which the resulting output for 800 testing samples as shown in the eighth row exactly coincides with the desired output. To facilitate presentations, according to the combination of the sign of s1 (t) and s2 ( t) , we have sorted the 800 testing samples in Figure 3 into four segments, such that each segment contains the same category response. As shown in Table 2, the PottsDA method is better than the RBF and the SVM methods in handling input parameters of linear mixtures with noises. The average execution time of the PottsDA for this example is 30.07 seconds. The third example tests the three methods for the spiral data as shown in Figure 4, where two distinct categories are denoted by stars and dots, respectively. In this example, both the training set and testing set contain 40 spiral-distributed interleave clusters, and each cluster contains 20 input parameters. The primary challenge to a learning method is to nd the centers of 40 clusters. For this example, the PottsDA method has an initial model of P 10 kernels. The quantity hdik i2 measured at step 7 in the PottsDA learning process along updating iterations is shown in Figure 5, which also displays the change of the model size. The nal model size has been further reduced to 40 by considering possible combinations of any two neighboring kernels. The obtained 40 kernels in one of ve executions with their category responses, denoted by circle or cross symbols, are plotted in Figure 4. The two lines in Figure 4 denote the two columns of the inverse of the obtained demixing matrix in direction. The average training and testing error rates of the three methods are shown in Table 3. Obviously, the PottsDA method also outperforms the RBF and the SVM methods for this example. 5.2 Breast Cancer Data. We use the Wisconsin Breast Cancer Database (as of July 1992) to test the PottsDA method for actual applications. This database contains 699 instances, each containing 9 features for predicting one of benign and malignant categories. There are 458 instances in the benign category and 241 instances in the malignant category in this database. The input parameters are clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses, each represented by integers ranging from 1 to 10. The original work (Wolberg & Mangasarian, 1990) ap-
706
Jiann-Ming Wu 1 .5
1
0 .5
0
-0.5
-1
-1.5 -1
-0. 8
- 0.6
-0. 4
-0.2
0
0. 2
0 .4
0 .6
0. 8
1
Figure 4: The training patterns of the third example and the result of the learning process, including the two columns of the inverse of the demixing matrix, the 40 kernels, and their category responses.
plied the multisurface method to a 369-case subset of the database, resulting in error testing rates more than 6% (Malini Lamego, 2001). Recently, Malini Lamego (2001) used the neural network with algebraic loops to deal with the rst 683 instances of the database. In his experiment, the last 200 instances of the 683 instances form the testing set, and the others form the learning set; the resulting error rates are 2.3% for learning and 4.5% for testing. It has been claimed (Malini Lamego, 2001) that the classication is more difcult than the previous one (Wolberg & Mangasarian, 1990), and the result is better than those of all previous works for this database. For comparison, we use the same training set and testing set as in Malini Lamego (2001) to evaluate the performance of the PottsDA method. The PottsDA method obtains a discriminant network with 42 kernels and has error rates of 1.4% for training and 1% for testing, as shown in Table 4. Only two instances in the testing set are incorrectly classied by the discriminant network. If the testing set also includes the last 19 instances in the database, one additional instance is missed by the same discriminant network, and the test error rate is 1.39%. For the 219-case test set, the RBF method with 80 kernels and the SVM method result in error rates of 4.17% and 4.63% for testing, respectively. The PottsDA method is signicantly better than the other two methods for the Wisconsin Breast Cancer Database.
Natural Discriminant Analysis
707
1 0.9 0.8 0.7 0.6
K=35
0.5 0.4 0.3
K=10
0.2 0.1 0
500
K=44
K=20 1000
1500
2000
2500
3000
Figure 5: The convergence of the learning network for the third example and the change of the model size. The vertical coordinate denotes the ratio of the square sum of the mean activations of membership vectors to the number of training patterns. The horizontal coordinate is the time index, each denoting a change of the beta value.
Table 3: Performance of the Three Methods, Third Example. RBF(40)
RBF(50)
RBF(60)
RBF(80)
SVM
PottsDA(40)
Training
14.6%
10.4%
7.8%
3.3%
45.5%
0.8%
Test
15.7
12.3
9.5
4.1
45.6
0.4
Note: Numbers in parentheses refer to number of kernels.
Table 4: Performance of the PottsDA Method and the Neural Network with Algebraic Loops for the 683-Case Subset of the Wisconsin Breast Cancer Database. PottsDA(42)
Neural Net with Algebraic Loops
Training (483)
1.4%
2.3%
Testing (200)
1
4.5
708
Jiann-Ming Wu
5.3 Discussion. The major improvement of the PottsDA method compared to the other methods is illustrated from four perspectives: the exibility of the discriminant network, effective collective decisions of the annealed recurrent learning network, the advantages of the generative model, and the capability of the incremental learning process.
5.3.1 Discriminant Network. The discriminant network of the PottsDA method composed of normalized multivariate gaussian activation functions fGA g of equation 1.3 is a general version of a normalized RBF network. To an k extreme large b, the discriminant function, equation 1.5, is a piecewise function composed of a set of local functions, each dened within an internal region of a faithful nonoverlapping partition into the parameter space based on the Mahalanobis distance k ¢ kA . This discriminant function is indeed a composition of a linear transformation and the nearest prototype classier as in equation 1.6, and it possesses more exibility for a desired mapping function. Consider the rst two articial examples, where the training parameters are results of linear mixtures of independent sources, and their targets are exactly encoded with source instances instead of mixtures. With an adaptive distance measure A, the PottsDA method succeeds in locating four kernels fy kg for the optimal discriminant function. In contrast, because of using the Euclidean distance and lacking circumspect efforts to recover independent instances from mixtures, the RBF method results in nonfaithful internal representations, a relatively large number of local functions based on a Voronoi partition. As shown in Tables 2 and 3, the testing error rate of the RBF method of 24 kernels is higher than that of the PottsDA method of four kernels. This reects the weakness of the discriminant function of a normalized RBF network as being a special case of the PottsDA method. 5.3.2 Annealed Recurrent Learning Network. The annealed recurrent learning network of the PottsDA method containing four sets of interactive dynamics is effective for optimizing highly coupled parameters of the discriminant network under an annealing process. The supervised learning process is formulated into the mathematical framework, equation 2.7, composed of a mixed integer and linear programming, and a hybrid of the mean-eld annealing and the gradient-descent method is employed to derive linear and nonlinear interactive dynamics. The primary advantage of the annealed recurrent learning network over a cascaded learning process or simply a gradient-descent-based learning process is the effect of collective decisions for four sets of continuous and discrete variables and the capability of escaping from the trap of tremendous local minima within the energy function to approach a global minimum. Consider the second articial example, where input parameters are a result of linear mixtures and one independent source is treated as noise to its discriminant rule. Collective decisions realized by the annealed recurrent learning network succeed in dealing with component analysis, clustering
Natural Discriminant Analysis
709
analysis, and labeling determination as a whole process and can thus achieve an optimal discriminant function for this problem. Numerical simulations show that the power of the discriminant network is strongly supported by the annealed recurrent learning network. Although the RBF method has employed the k-means algorithm to set up its initial kernels, it cannot well handle the interference caused by linear mixtures and noises due to the limitation from its discriminant function and learning process. Further evaluations show that the testing error rate of the RBF method of 100 kernels is still above 10% for this example. 5.3.3 Generative Model. Both the discriminant network and the annealed recurrent learning network are rooted from the generative model of multiple disjoint multivariate gaussian distributions. The overall distribution corresponding to the generative model is general enough to characterize an arbitrary parameter space, and the involved parameter estimation gains potential advantages from maximal likelihood principle, which provides solid theoretical fundamentals to develop the mathematical framework. To realize natural discriminant analysis, the PottsDA method initiates a generative model to all predictors, using the annealed recurrent learning network to adapt the kernels and the covariance matrix subject to interpolating conditions proposed by paired training samples, and using the discriminant network to classify instances. A simplied version of the generative model is simulated with a unied covariance matrix. For the Wisconsin Breast Cancer Database, its performance is encouraging. The obtained parameters, including the 42 kernels and the covariance matrix, could provide feedback for understanding relations among components of predictors. 5.3.4 Incremental Learning Process. The incremental learning process is capable of determining minimal model size subject to interpolating conditions. Consider the third articial example composed of 40 interleaving clusters in two different classes. Without interferences caused by linear mixtures and noises, this problem tests the capability for clustering analysis. For this example, the RBF method behaves better than the SVM method, but it still takes the RBF method 80 kernels to produce a testing error rate of 4.1%. The incremental learning process of the PottsDA method is more effective. It obtains a discriminant network of 40 kernels with testing error rates near zero. 6 Conclusions
We have proposed a new learning process for discriminant analysis based on the four sets of interactive dynamics, and its encouraging performance has been shown by numerical simulations for some articial and real examples. To develop the interactive dynamics, we have proposed multiple disjoint multivariate gaussian distributions to serve as a generative model for
710
Jiann-Ming Wu
the parameter space. By combining the maximization of the log-likelihood functions, the tness of the generative model to all input parameters, and the minimization of the design cost, we have a mathematical framework for discriminant analysis. By relating the discrete variables to Potts neural variables, we can further apply a hybrid of the mean-eld annealing and gradient-descent methods to the optimization of the mixed integer and linear programing and obtain the four sets of interactive dynamics performing the annealed recurrent learning network for discriminant analysis. The new learning process is of a parallel and distributed process, and its evolution is well controlled by an annealing process in an analog with the physical annealing. An effective incremental learning procedure is also developed for optimizing the model size. The adaptive covariance matrix of the discriminant network plays a central role of retrieving the unknown mixing structure within the input parameters and extracting output-dependent features for discriminant analysis. The new learning process is effective for developing faithful internal representations of the input parameters and constructing essential boundary structures for classication. Appendix
That steps 2–5 in the learning process converge can be proved. Rewrite the mean-eld equations in the context as the following continuous form, ¡@E ( y, A, hdi , hj i) du ik @y D ¡ D dt @ hdik i @ hdik i µ ¶ exp(bui1 ) exp(buiK ) 0 hdi i D P ¢¢¢ P l exp(buil ) l exp(buil ) X exp(buik ) P D ek l exp(buil ) k
and ¡@E ( y, A, hdi , hj i) dv km @y D ¡ D dt @ hj km i @ hj km i µ ¶ exp(bv k1 ) exp(bv kM ) 0 hj k i D P ¢¢¢ P ) ) l exp(bv kl l exp(bv kl X exp(bv kh ) P D eh , l exp(bv kl ) h
where vector ek is a standard unit vector of which the kth element is one. Then rewrite the updating rule as the following dynamics: dAmn @y @E ( y, A, hdi , hj i) ´ ¡g1 D ¡g1 dt @Amn @Amn
Natural Discriminant Analysis
711
and dy k @y @E ( y, A, hdi , hj i) ´ ¡g2 . D ¡g2 dt @y k @y k Then the convergence of the free energy y along the trace of these four sets of dynamics can be shown: X ³ @y ´0 d hd k i X ³ @y ´0 d hjk i dy C D dt @ hdi i dt @ hjk i dt i k ³ ´ ³ ´ X @y dA mn X @y 0 dy k C C @Amn dt @yk dt mn k X ³ dui ´0 ³ dui ´ X ³ dv k ´0 ³ dv k ´ C1 ¡ C2 D ¡ dt dt dt dt i k ³ ´ ³ ´ ³ ´ X dA mn X dy k 0 ³ dy k ´ dAmn ¡ g1 ¡ g2 , dt dt dt dt mn k
50 where C 1 is the Hessian of ln z ( u k, b ) , P 0 0 [sl ] exp(b hdk i sl )[sl ¡ hd k i][sl ¡ hd k i] P C1 D . 0 [sl ] exp(b hd k i sl ) [sk] runs over fe1 , . . . , eK g . Since C 1 is positive denite, ³ ´0 ³ ´ du k du k C1 > 0. dt dt For the same reason, ³ ´ ³ ´ dvm 0 dvm C2 > 0. dt dt dy dt
5 0 is shown.
References Aiyer, S. V. B., Niranjan, M., & Fallside, F. (1990). A theoretical investigation into the performance of the Hopeld model. IEEE Trans. Neural Networks, 1, 204–215. Benaim, M. (1994). On functional approximation with normalized gaussian units. Neural Computation, 6, 319–333. Cawley, G. C. (2000). MATLAB Support Vector Machine Toolbox. Available online at: http://theoval.sys.uea.ac.uk/svm/toolbox.
712
Jiann-Ming Wu
Freeman, J. A. S., & Saad, D. (1995). Learning and generalization in radial basis function networks. Neural Computation, 7, 1000–1020. Girosi, F. (1998). An equivalence between sparse approximation and support vector machine. Neural Computation, 10, 1455–1480. Girosi, F., Jones, M., & Poggio, T. (1995). Regularization theory and neural networks architectures. Neural Computation, 7, 219–269. Hastie, T., Buja, A., & Tibshirani, R. (1995).Penalized discriminant-analysis. Ann. Stat., 23, 73–102. Hastie, T., & Simard, P. Y. (1998). Metrics and models for handwritten character recognition. Stat. Sci., 13, 54–65. Hastie, T., & Tibshirani, R. (1996). Discriminant analysis by gaussian mixtures. J. Roy. Stat. Soc. B Met., 58, 155–176. Hastie, T., Tibshirani, R., & Buja, A. (1994). Flexible discriminant-analysis by optimal scoring. J. Am. Stat. Assoc., 89, 1255–1270. Lin, J. K., Grier, D. G., & Cowan, J. D. (1997). Faithful representation of separable distribution. Neural Computation, 9, 1305–1320. Liou, C. Y., & Wu, J.-M. (1996). Self-organization using Potts models. Neural Networks, 9, 671–684. Makeig, S., Jung, T. P., & Bell, A. J. (1997). Blind separation of auditory event– related brain responses into independent components. P. Natl. Acad. Sci. USA, 94, 10979–10984. Malini Lamego, M. (2001). Adaptive structures with algebraic loops. IEEE Trans. Neural Networks, 12, 33–42. Miller, D. J., & Uyar, H. S. (1998). Combined learning and use for a mixture model equivalent to the RBF classier. Neural Computation, 10, 281–293. Moody, J., & Darken, C. (1988). Learning with localized receptive elds. In D. Touretzky, G. Hinton, & T. Sejnowski (Eds.), Proceedings of the 1988 Connectionist Models Summer School (pp. 133–143). San Mateo, CA: Morgan Kaufmann. Moody, J., & Darken, C. (1989). Fast learning in networks of locally-tuned processing units. Neural Computation, 1, 281–294. Muller, K.-R., Smola, A. J., Ratsch, G., Scholkopf, B., Kohlmorgen, J., & Vapnik, V. (1999). Using support vector machines for time series prediction. In B. Scholkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods— Support vector learning (pp. 243–254). Cambridge, MA: MIT Press. Peterson, C., & Soderberg, ¨ B. (1989). A new method for mapping optimization problems onto neural network. Int. J. Neural Syst., 1, 3–22. Pineda, F. J. (1987).Generalization of back-propagation to recurrent neural networks. Physical Review Letters, 59, 2229–2232. Pineda, F. J. (1989). Recurrent back-propagation and the dynamical approach to adaptive neural computation. Neural Computation, 1, 161–172. Platt, J. C. (1999). Fast training of support vector machines using sequential minimal optimization. In B. Scholkopf, C. Burges, & A. J. Smola (Eds.), Advances in kernel methods—Support vector learning (pp. 185–208). Cambridge, MA: MIT Press. Rao, A. V., Miller, D. J., Rose, K., & Gersho, A. (1999). A deterministic annealing approach for parsimonious design of piecewise regression models. IEEE Trans. on Pattern Analysis and Machine Intelligence, 21, 159–173.
Natural Discriminant Analysis
713
Ratsch, G., Onoda, T., & Muller, K. R. (2001). Soft margins for AdaBoost. Machine Learning, 42, 287–320. Rose, K., Gurewitz, E., & Fox, G. C. (1990). Statistical mechanics and phase transitions in clustering. Phys. Rev. Lett., 65, 945–948. Rose, K., Gurewitz, E., & Fox, G. C. (1993). Constrained clustering as an optimization method. IEEE Trans. on Pattern Analysis and Machine Intelligence, 15, 785–794. Rumelhart, D. E., & McClelland, J. L. (1986). Parallel and distributed processing (Vol. 1). Cambridge, MA: MIT Press. Vapnik, V. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Wolberg, W. H., & Mangasarian, O. L. (1990). Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proceedings of the National Academy of Science, 87, 9193–9196. Received September 8, 2000; accepted June 18, 2001.
Communicated by Peter Foldiak
ARTICLE
Slow Feature Analysis: Unsupervised Learning of Invariances Laurenz Wiskott
[email protected] Computational Neurobiology Laboratory, Salk Institute for Biological Studies, San Diego, CA 92168, U.S.A.; Institute for Advanced Studies, D-14193, Berlin, Germany; and Innovationskolleg Theoretische Biologie, Institute for Biology, Humboldt-University Berlin, D-10115 Berlin, Germany
Terrence J. Sejnowski
[email protected] Howard Hughes Medical Institute, The Salk Institute for Biological Studies, La Jolla, CA 92037, U.S.A., and Department of Biology, University of California at San Diego, La Jolla, CA 92037, U.S.A.
Invariant features of temporally varying signals are useful for analysis and classication. Slow feature analysis (SFA) is a new method for learning invariant or slowly varying features from a vectorial input signal. It is based on a nonlinear expansion of the input signal and application of principal component analysis to this expanded signal and its time derivative. It is guaranteed to nd the optimal solution within a family of functions directly and can learn to extract a large number of decorrelated features, which are ordered by their degree of invariance. SFA can be applied hierarchically to process high-dimensional input signals and extract complex features. SFA is applied rst to complex cell tuning properties based on simple cell output, including disparity and motion. Then more complicated input-output functions are learned by repeated application of SFA. Finally, a hierarchical network of SFA modules is presented as a simple model of the visual system. The same unstructured network can learn translation, size, rotation, contrast, or, to a lesser degree, illumination invariance for one-dimensional objects, depending on only the training stimulus. Surprisingly, only a few training objects sufce to achieve good generalization to new objects. The generated representation is suitable for object recognition. Performance degrades if the network is trained to learn multiple invariances simultaneously. 1 Introduction Generating invariant representations is one of the major problems in object recognition. Some neural network systems have invariance properties Neural Computation 14, 715–770 (2002)
° c 2002 Massachusetts Institute of Technology
716
L. Wiskott and T. Sejnowski
built into the architecture. In both the neocognitron (Fukushima, Miyake, & Ito, 1983) and the weight-sharing backpropagation network (LeCun et al., 1989), for instance, translation invariance is implemented by replicating a common but plastic synaptic weight pattern at all shifted locations and then spatially subsampling the output signal, possibly blurred over a certain region. Other systems employ a matching dynamics to map representations onto each other in an invariant way (Olshausen, Anderson, & Van Essen, 1993; Konen, Maurer, & von der Malsburg, 1994). These two types of systems are trained or applied to static images, and the invariances, such as translation or size invariance, need to be known in advance by the designer of the system (dynamic link matching is less strict in specifying the invariance in advance; Konen et al., 1994). The approach described in this article belongs to a third class of systems based on learning invariances from temporal input sequences (FoldiÂak, ¨ 1991; Mitchison, 1991; Becker, 1993; Stone, 1996). The assumption is that primary sensory signals, which in general code for local properties, vary quickly while the perceived environment changes slowly. If one succeeds in extracting slow features from the quickly varying sensory signal, one is likely to obtain an invariant representation of the environment. The general idea is illustrated in Figure 1. Assume three different objects in the shape of striped letters that move straight through the visual eld with different direction and speed, for example, rst an S, then an F, and then an A. On a high level, this stimulus can be represented by three variables changing over time. The rst one indicates object identity, that is, which of the letters is currently visible, assuming that only one object is visible at a time. This is the what-information. The second and third variable indicate vertical and horizontal location of the object, respectively. This is the where-information. This representation would be particularly convenient because it is compact and the important aspects of the stimulus are directly accessible. The primary sensory input—the activities of the photoreceptors in this example—is distributed over many sensors, which respond only to simple localized features of the objects, such as local gray values, dots, or edges. Since the sensors respond to localized features, their activities will change quickly, even if the stimulus moves slowly. Consider, for example, the response of the rst photoreceptor to the F. Because of the stripes, as the F moves across the receptor by just two stripe widths, the activity of the receptor rapidly changes from low to high and back again. This primary sensory signal is a low-level representation and contains the relevant information, such as object identity and location, only implicitly. However, if receptors cover the whole visual eld, the visual stimulus is mirrored by the primary sensory signal, and presumably there exists an input-output function that can extract the relevant information and compute a high-level representation like the one described above from this low-level representation.
Slow Feature Analysis
717
Stimulus x 2 (t) x 3 (t)
visual field x 1(t)
Primary sensory signal
Object identity
x 1 (t)
F A S
x 2 (t)
time t
Object 2D location
x 3 (t)
top
time t
bottom left right time t
Figure 1: Relation between slowly varying stimulus and quickly varying sensor activities. (Top) Three different objects, the letters S, F, and A, move straight through the visual eld one by one. The responses of three photoreceptors to this stimulus at different locations are recorded and correspond to the gray value proles of the letters along the dotted lines. (Bottom left) Activities of the three (out of many) photoreceptors over time. The receptors respond vigorously when an object moves through their localized receptive eld and are quiet otherwise. High values indicate white; low values indicate black. (Bottom right) These three graphs show a high-level representation of the stimulus in terms of object identity and object location over time. See the text for more explanation.
718
L. Wiskott and T. Sejnowski
What can serve as a general objective to guide unsupervised learning to nd such an input-output function? One main difference between the high-level representation and the primary sensory signal is the timescale on which they vary. Thus, a slowly varying representation can be considered to be of higher abstraction level than a quickly varying one. It is important to note here that the input-output function computes the output signal instantaneously, only on the basis of the current input. Slow variation of the output signal can therefore not be achieved by temporal low-pass ltering, but must be based on extracting aspects of the input signal that are inherently slow and useful for a high-level representation. The vast majority of possible input-output functions would generate quickly varying output signals, and only a very small fraction will generate slowly varying output signals. The task is to nd some of these rare functions. The primary sensory signal in Figure 1 is quickly varying compared to the high-level representation, even though the components of the primary sensory signal have extensive quiet periods due to the blank background of the stimulus. The sensor of x1 , for instance, responds only to the F, because its receptive eld is at the bottom right of the visual eld. However, this illustrative example assumes an articial stimulus, and under more natural conditions, the difference in temporal variation between the primary sensory signals and the high-level representation should be even more pronounced. The graphs for object identity and object location in Figure 1 have gaps between the linear sections representing the objects. These gaps need to be lled in somehow. If that is done, for example, with some constant value, and the graphs are considered as a whole, then both the object identity and location vary on similar timescales, at least in comparison to the much faster varying primary sensory signals. This means that object location and object identity can be learned based on the objective of slow variation, which sheds new light on the problem of learning invariances. The common notion is that object location changes quickly while object identity changes slowly, or rarely, and that the recognition system has to learn to represent only object identity and ignore object location. However, another interpretation of this situation is that object translation induces quick changes in the primary sensory signal, in comparison to which object identity and object location vary slowly. Both of these aspects can be extracted as slow features from the primary sensory signal. While it is conceptionally convenient to call object identity the what-information of the stimulus and object location the whereinformation, they are of a similar nature in the sense that they may vary on similar timescales compared to the primary sensory signal. In the next two sections, a formal denition of the learning problem is given, and a new algorithm to solve it is presented. The subsequent sections describe several sample applications. The rst shows how complex cell behavior found in visual cortex can be inferred from simple cell outputs. This is extended to include disparity and direction of motion in the second example. The third example illustrates that more complicated input-output
Slow Feature Analysis
719
functions can be approximated by applying the learning algorithm repeatedly. The fourth and fth examples show how a hierarchical network can learn translation and other invariances. In the nal discussion, the algorithm is compared with previous approaches to learning invariances. 2 The Learning Problem The rst step is to give a mathematical denition for the learning of invariances. Given a vectorial input signal x(t), the objective is to nd an input-output function g(x) such that the output signal y(t) :D g(x(t) ) varies as slowly as possible while still conveying some information about the input to avoid the trivial constant response. Strict invariances are not the goal, but rather approximate ones that change slowly. This can be formalized as follows: Learning problem. Given an I-dimensional input signal x(t) D [x1 (t ) , . . . , xI ( t) ]T with t 2 [t0 , t1 ] indicating time and [. . .]T indicating the transpose of [. . .]. Find an input-output function g(x) D [g1 (x) , . . . , g J (x)]T generating the J-dimensional output signal y(t) D [y1 (t ) , . . . , y J ( t) ]T with y j ( t) :D g j (x( t) ) such that for each j 2 f1, . . . , Jg
D j :D D ( y j ) :D hyP 2j i is minimal
(2.1)
under the constraints hy j i D 0
8 j < j: 0
hy2j i
D 1
hy j0 y j i D 0
(zero mean),
(2.2)
(unit variance),
(2.3)
(decorrelation),
(2.4)
where the angle brackets indicate temporal averaging, that is, Z t1 1 h f i :D f ( t) dt. t1 ¡ t 0 t 0 Equation 2.1 expresses the primary objective of minimizing the temporal variation of the output signal. Constraints 2.2 and 2.3 help avoid the trivial solution y j ( t) D const. Constraint 2.4 guarantees that different output signal components carry different information and do not simply reproduce each other. It also induces an order, so that y1 ( t) is the optimal output signal component, while y2 ( t) is a less optimal one, since it obeys the additional constraint hy1 y2 i D 0. Thus, D ( y j0 ) · D ( y j ) if j0 < j. The zero mean constraint, 2.2, was added for convenience only. It could be dropped, in which case constraint 2.3 should be replaced by h(y j ¡ hy j i) 2 i D 1 to avoid the trivial solutions y1 ( t) D § 1. One can also drop the unit variance constraint, 2.3, and integrate it into the objective, which would
720
L. Wiskott and T. Sejnowski
then be to minimize hyP 2j i / h(y j ¡hy j i) 2 i. This is the formulation used in (Becker and Hinton 1992). However, integrating the two constraints (equations 2.2 and 2.3) in the objective function leaves the solution undetermined by an arbitrary offset and scaling factor for y j . The explicit constraints make the solution less arbitrary. This learning problem is an optimization problem of variational calculus and in general is difcult to solve. However, for the case that the inputoutput function components g j are constrained to be a linear combination of a nite set of nonlinear functions, the problem simplies signicantly. An algorithm for solving the optimization problem under this constraint is given in the following section. 3 Slow Feature Analysis Given an I-dimensional input signal x(t) D [x1 ( t) , . . . , xI ( t) ]T , consider an input-output function g(x) D [g1 (x) , . . . , g J (x) ]T , each component of which P is a weighted sum over a set of K nonlinear functions h k (x) : g j (x) :D KkD1 w jkh k (x). Usually K > max(I, J ) . Applying h D [h1 , . . . , hK ]T to the input signal yields the nonlinearly expanded signal z(t) :D h(x(t) ) . After this nonlinear expansion, the problem can be treated as linear in the expanded signal components z k ( t) . This is a common technique to turn a nonlinear problem into a linear one. A well-known example is the support vector machine (Vapnik, 1995). The weight vectors w j D [w j1 , . . . , w jK ]T are subject to learning, and the jth output signal component is given by y j ( t) D g j (x( t) ) D wTj h(x(t) ) D wTj z(t) . The objective (see equation 2.1) is to optimize the input-output function and thus the weights such that
D (yj )
D hyP 2j i D wTj hPzzP T iw j
(3.1)
is minimal. Assume the nonlinear functions h k are chosen such that the expanded signal z(t) has zero mean and a unit covariance matrix. Such a set h k of nonlinear functions can be easily derived from an arbitrary set h0k by a sphering stage, as will be explained below. Then we nd that the constraints (see equations 2.2–2.4) hy j i D wTj hzi D 0, |{z} hy2j i 8 j0 < j:
D
wTj
(3.2)
D0
hzzT i w j D wTj w j D 1, | {z }
(3.3)
DI
hy j0 y j i D wTj0 hzzT i w j D wTj0 w j D 0, | {z }
(3.4)
DI
are automatically fullled if and only if we constrain the weight vectors to be an orthonormal set of vectors. Thus, for the rst component of the input-
Slow Feature Analysis
721
output function, the optimization problem reduces to nding the normed weight vector that minimizes D ( y1 ) of equation (3.1). The solution is the normed eigenvector of matrix hPzzP T i that corresponds to the smallest eigenvalue (cf. Mitchison, 1991). The eigenvectors of the next higher eigenvalues produce the next components of the input-output function with the next higher D values. This leads to an algorithm for solving the optimization problem stated above. It is useful to make a clear distinction among raw signals, exactly normalized signals derived from training data, and approximately normalized signals derived from test data. Let x˜ (t) be a raw input signal that can have any mean and variance. For computational convenience and display purposes, the signals are normalized to zero mean and unit variance. This normalization is exact for the training data x(t). Correcting test data by the same offset and factor will in general yield an input signal x0 (t ) that is only approximately normalized, since each data sample has a slightly different mean and variance, while the normalization is always done with the offset and factor determined from the training data. In the following, raw signals have a tilde, and test data have a dash; symbols with neither a tilde nor a dash usually (but not always) refer to normalized training data. The algorithm now has the following form (see also Figure 3): 1. Input signal. For training, an I-dimensional input signal is given by x(t). ˜ 2. Input signal normalization. Normalize the input signal to obtain x(t) :D [x1 ( t) , . . . , xI (t) ]T xQ i ( t) ¡ hxQ i i with xi ( t) :D p , h( xQ i ¡ hxQ i i) 2 i so that hxi i D 0, and
hx2i i
D 1.
(3.5) (3.6) (3.7) (3.8)
˜ 3. Nonlinear expansion. Apply a set of nonlinear functions h(x) to generate an expanded signal z˜ (t) . Here all monomials of degree one (resulting in linear SFA sometimes denoted by SFA1 ) or of degree one and two including mixed terms such as x1 x2 (resulting in quadratic SFA sometimes denoted by SFA2 ) are used, but any other set of functions could be used as well. Thus, for quadratic SFA, ˜ h(x) :D [x1 , . . . , xI , x1 x1 , x1 x2 , . . . , xI xI ]T , ˜ z˜ (t) :D h(x(t) ) D [x1 ( t) , . . . , xI ( t) , x1 (t ) x1 (t) , x 1 ( t ) x 2 ( t ) , . . . , x I ( t ) x I ( t ) ]T .
(3.9) (3.10)
˜ Using rst- and second-degree monomials, h(x) and z˜ ( t) are of dimensionality K D I C I ( I C 1) / 2.
722
L. Wiskott and T. Sejnowski
1
2
x1 t
x2 t 0
1
1
1
0
z1 t 0
1
1
2
2
z3 t 1 0
1 15
1
15
z1 t 0
z1 t 0
1
2
z2 t 0 15
z t 15 30 15 w1 axis
15
c) Sphered expanded signal z(t)
0
t
. d) Time derivative signal z(t) 1
2
0 1
x2 0
1
2 1
1 yt
z3 t 1
b) Expanded signal ~z(t)
a) Input signal x(t) z2 t 0 1
0
2
1
1
z2 t 0 1
x1
0 1 2
e) Output signal y(t)
g x1 ,x2
f) Input output function g(x)
Slow Feature Analysis
723
4. Sphering. Normalize the expanded signal z˜ ( t) by an afne transformation to generate z(t) with zero mean and identity covariance matrix I, z(t) :D S (z(t) ˜ ¡ h˜zi), with and
hzi D 0
hz zT i D I.
(3.11) (3.12) (3.13)
This normalization is called sphering (or whitening). Matrix S is the sphering matrix and can be determined with the help of principal component analysis (PCA) on matrix (z(t) ˜ ¡ h˜zi). It therefore depends on the specic training data set. This also denes ˜ h(x) :D S ( h(x) ¡ h˜z i) ,
(3.14)
which is a normalized function, while z(t) is the sphered data. 5. Principal component analysis. Apply PCA to matrix hPz zP T i. The J eigenvectors with lowest eigenvalues l j yield the normalized weight vectors wj: with
hPz zP T iw j D l j w j l1 · l2 · ¢ ¢ ¢ · lJ ,
(3.15) (3.16)
which provide the input-output function
with
g(x) :D [g1 (x) , . . . , g J (x) ]T
(3.17)
g j (x) :D wTj h(x)
(3.18)
Figure 2: Facing page. Illustration of the learning algorithm by means of a simplied example. (a) Input signal x˜ ( t ) is given by xQ 1 ( t) :D sin ( t) C cos2 (11 t ), xQ 2 ( t ) :D cos(11 t) , t 2 [0, 2p ], where sin ( t) constitutes the slow feature signal. Shown is the normalized input signal x(t) . (b) Expanded signal z˜ ( t) is dened as zQ 1 ( t) :D x1 ( t ) , zQ 2 ( t ) :D x2 ( t) , and zQ 3 ( t ) :D x22 ( t) . x21 ( t) and x1 ( t) x2 ( t ) are left out for easier display. (c) Sphered signal z(t) has zero mean and unit covariance matrix. Its orientation in space is algorithmically determined by the principal axes of z˜ ( t) but otherwise arbitrary. (d) Time derivative signal zP ( t ) . The direction of minimal variance determines the weight vector w 1 . This is the direction in which the sphered signal z(t) varies most slowly. The axes of next higher variance determine the weight vectors w2 and w3 , shown as dashed lines. (e) Projecting the sphered signal z(t) onto the w1 -axis yields the rst output signal component y1 ( t ) , which is the slow feature signal sin ( t) . (f) The rst component g1 ( x1 , x2 ) of the input-output function derived by the steps a to e is shown as a contour plot.
724
L. Wiskott and T. Sejnowski
and the output signal y(t) :D g(x(t) ) hyi D 0,
with
T
and
D (yj )
hy y i D I,
D hyP 2j i D l j .
(3.19) (3.20) (3.21) (3.22)
The components of the output signal have exactly zero mean, unit variance, and are uncorrelated. 6. Repetition. If required, use the output signal y(t) (or the rst few components of it or a combination of different output signals) as an input signal x(t) for the next application of the learning algorithm. Continue with step 3. 7. Test. In order to test the system on a test signal, apply the normalization and input-output function derived in steps 2 to 6 to a new input signal x˜ 0 ( t) . Notice that this test signal needs to be normalized with the same offsets and factors as the training signal to reproduce the learned input-output relation accurately. Thus, the training signal is normalized only approximately to yield x0 (t ) :D [x01 ( t) , . . . , x0I ( t) ]T xQ 0 ( t) ¡ hxQ i i with x0i (t ) :D p i , h( xQ i ¡ hxQ i i) 2 i so that hx0i i ¼ 0, and
02
hxi i ¼ 1.
(3.23) (3.24) (3.25) (3.26)
The normalization is accurate only to the extent the test signal is representative for the training signal. The same is true for the output signal y0 ( t) :D g(x0 ( t) )
with and
0
hy0 i ¼ 0, 0T
hy y i ¼ I.
(3.27) (3.28) (3.29)
For practical reasons, singular value decomposition is used in steps 4 and 5 instead of PCA. Singular value decomposition is preferable for analyzing degenerate data in which some eigenvalues are very close to zero, which are then discarded in Step 4. The nonlinear expansion sometimes leads to degenerate data, since it produces a highly redundant representation where some components may have a linear relationship. In general, signal components with eigenvalues close to zero typically contain noise, such as rounding errors, which after normalization is very quickly uctuating and
Slow Feature Analysis
725 y1 (t)
g 1 (x) ~ ~ ~ w11 h 1(x) w 10
y2 (t) g 2 (x)
~ ~h (x) w 15 5
y1 (t) ~ g 1(z) ~ ~ w11 w 10
~ z1 (t) ~ h1 (x)
x 2(t)
y 2 (t)
non linear units g(x)
~ g 2 (z)
input signal x(t)
y 3 (t) ~ g 3(z)
~ w 15
z~3 (t) ~ h 3 (x)
~ z 4 (t) ~ h4 (x)
output signal y(t) ~ linear units g(z)
~ w 35
~ z 2 (t) ~ h2 (x)
output signal y(t)
~ ~h (x) w 35 5 ~ modifyable weights w non linear ~ synaptic clusters h(x)
x 1 (t)
bias
g3 (x)
1
bias
1
y 3 (t)
~ modifyable weights w linear synapses ~ z5 (t) ~ h5 (x)
~ non linear units h(x)
non linear synaptic clusters
x 1(t)
x 2 (t)
input signal x(t)
Figure 3: Two possible network structures for performing SFA. (Top) Interpretation as a group of units with complex computation on the dendritic trees (thick lines), such as sigma-pi units. (Bottom) Interpretation as a layered network of simple units with xed nonlinear units in the hidden layer, such as radial basis function networks with nonadaptable hidden units. In both cases, the input˜ , with output function components are given by g j (x) D wTj h(x) D w Q j0 C w ˜ Tj h(x) appropriate raw weight vectors w ˜ j . The input signal components are assumed to be normalized here.
would not be selected by SFA in Step 5 in any case. Thus, the decision as to which small components should be discarded is not critical. It is useful to measure the invariance of signals not by the value of D directly but by a measure that has a more intuitive interpretation. A good
726
L. Wiskott and T. Sejnowski
measure may be an index g dened by g ( y ) :D
T p D ( y) 2p
(3.30)
p for t 2 [t0 , t0 C T]. For a pure sine wave y (t) :D 2 sin(n 2p t / T ) with an integer number of oscillations n the index g ( y ) is just the number of oscillations, that is, g ( y ) D n.1 Thus, the index g of an arbitrary signal indicates what the number of oscillations would be for a pure sine wave of same D value, at least for integer values of g. Low g values indicate slow signals. Since output signals derived from test data are only approximately normalized, g ( y0 ) is meant to include an exact normalization of y0 to zero mean and unit variance, to make the g index independent of an accidental scaling factor. 3.1 Neural Implementation. The SFA algorithm is formulated for a training input signal x(t) of denite length. During learning, it processes this signal as a single batch in one shot and does not work incrementally, such as on-line learning rules do. However, SFA is a computational model for neural processing in biological systems and can be related to two standard network architectures (see Figure 3). The nonlinear basis functions hQ k (x) can, for instance, be considered as synaptic clusters on the dendritic tree, which locally perform a xed nonlinear transformation on the input data and can be weighted independent of other synaptic clusters performing other but also xed nonlinear transformations on the input signal (see Figure 3, top). Sigma-pi units (Rumelhart, Hinton, & McClelland, 1986), for instance, are of this type. In another interpretation, the nonlinear basis functions could be realized by xed nonlinear units in a hidden layer. These then provide weighted inputs to linear output units, which can be trained (Figure 3, bottom). The radial basis function network with nonadaptable basis functions is an example of this interpretation (cf. Becker & Hinton, 1995; see also Bishop, 1995, for an introduction to radial basis function networks). Depending on the type of nonlinear functions used for hQ k (x) , the rst or the second interpretation is more appropriate. Lateral connections between the output p p For symmetry reasons and since ( 2 sin (x))2 C ( 2 cos (x))2 D 2, it is evident that p 2 hyi D 0 and hy i D 1 if averaging y (t) :D 2 sin(n 2p t / T) over t 2 [t 0 , t 0 C T], that is, over an integer number of oscillations n. Setting t 0 D 0 without loss of generality, we nd that 1
D (y) D hyP 2 i D
1 T
Z
T 0
n2 4p 2 n2 4p 2 1 2 cos2 (n 2p t / T) dt D 2 T T2 n 2p
and g(y) D
T p T D (y) D 2p 2p
Z
n 2p 0
r n2 4p 2 D n. T2
2 cos2 (t0 ) dt0 D
n2 4p 2 T2
Slow Feature Analysis
727
units g j , either only from lower to higher units as shown in the gure or between all units, are needed to decorrelate the output signal components by some variant of anti-Hebbian learning (see Becker & Plumbley, 1996, for an overview). Each of these two networks forms a functional unit performing SFA. In the following we will refer to such a unit as an SFA module, modeled by the algorithm described above. 4 Examples The properties of the learning algorithm are now illustrated by several examples. The rst example is about learning response behavior of complex cells based on simple cell responses (simple and complex cells are two types of cells in primary visual cortex). The second one is similar but also includes estimation of disparity and motion. One application of slow feature analysis is sufcient for these two examples. The third example is more abstract and requires a more complicated input-output function, which can be approximated by three SFAs in succession. This leads to the fourth example, which shows a hierarchical network of SFA modules learning translation invariance. This is generalized to other invariances in example 5. Each example illustrates a different aspect of SFA; all but the third example also refer to specic learning problems in the visual system and present possible solutions on a computational level based on SFA, although these examples do not claim to be biologically plausible in any detail. All simulations were done with Mathematica. 4.1 Examples 1 and 2: Complex Cells, Disparity, and Motion Estimation. The rst two examples are closely related and use subsets of the same data. Consider ve monocular simple cells for the left and right eyes with receptive elds, as indicated in Figure 4. The simple cells are modeled by spatial Gabor wavelets (Jones & Palmer, 1987), whose responses xQ ( t) to a visual stimulus smoothly moving across the receptive eld are given by a combination of nonnegative amplitude aQ (t ) and phase wQ ( t) both varying in time: xQ (t) :D aQ ( t) sin( wQ (t ) ) . The output signals of the ve simple cells shown in Figure 4 are modeled by xQ 1 ( t) :D (4 C a0 ( t) ) sin(t C 4 w 0 (t) ) ,
(4.1)
xQ 2 ( t) :D (4 C a1 ( t) ) sin(t C 4 w 1 (t) ) ,
(4.2)
xQ 3 ( t) :D (4 C a1 ( t) ) sin(t C 4 w 1 (t) C p / 4) ,
xQ 4 ( t) :D (4 C a1 ( t) ) sin(t C 4 w 1 (t) C p / 2 C 0.5w D ( t) ) ,
xQ 5 ( t) :D (4 C a1 ( t) ) sin(t C 4 w 1 (t) C 3p / 4 C 0.5w D ( t) ) ,
(4.3) (4.4) (4.5)
728
L. Wiskott and T. Sejnowski
Simple cell receptive fields in left and right eye + +
~ x1 (t) left eye
right eye
~ x2 (t)
~ x4 (t)
~ x3 (t)
~ x5 (t)
Input signal x(t) (normalized simple cell output signals) 32.71
x1 t
x5 t
t
x10 t
t
x5 t x2 t
30.95
x10 t
30.95
t
x3 t
t
x2 t
t
35.03
x3 t
x2 t
x1 t
33.36
x3 t
x5 t
Figure 4 (Examples 1 and 2): (Top) Receptive elds of ve simple cells for left and right eye, which provide the input signal x˜ ( t) . (Bottom) Selected normalized input signal components xi ( t) plotted versus time and versus each other. All graphs range from ¡4 to C 4; time axes range from 0 to 4p .
t 2 [0, 4p ]. All signals have a length of 512 data points, resulting in a step size of D t D 4p / 512 between successive data points. The amplitude and phase modulation signals are low-pass ltered gaussian white noise normalized to zero mean and unit variance. The width s of the gaussian low-pass lters is 30 data points for w D ( t) and 10 for the other four signals: a 0 (t) , a1 (t) , w 0 ( t) , and w 1 ( t) . Since the amplitude signals a0 ( t) and a1 ( t) have positive and negative values, an offset of 4 was added to shift these signals to a positive range, as required for amplitudes. The linear ramp t within the sine ensures that all phases occur equally often; otherwise, a certain phase would be overrepresented because the phase signals w 0 ( t) and w 1 ( t) are concentrated around zero. The factor 4 in front of the phase signals ensures that phase changes more quickly than amplitude. The raw signals xQ 1 (t) , . . . , xQ 5 ( t) are
Slow Feature Analysis
729
normalized to zero mean and unit variance, yielding x1 ( t) , . . . , x5 ( t) . Five additional signal components are obtained by a time delay of 1 data point: x6 ( t) :D x1 (t ¡ D t ) , . . . , x10 (t) :D x5 (t ¡ D t). Some of these 10 signals are shown in Figure 4 (bottom), together with some trajectory plots of one component versus another. The g value of each signal as given by equation 3.30 is shown in the upper right corner of the signal plots. Since the rst simple cell has a different orientation and location from the others, it also has a different amplitude and phase modulation. This makes it independent, as can be seen from trajectory plot x2 ( t) versus x1 ( t) , which does not show any particular structure. The rst simple cell serves only as a distractor, which needs to be ignored by SFA. The second and third simple cells have the same location, orientation, and frequency. They therefore have the same amplitude and phase modulation. However, the positive and negative subregions of the receptive elds are slightly shifted relative to each other. This results in a constant phase difference of 45± (p / 4) between these two simple cells. This is reected in the trajectory plot x3 ( t) versus x2 ( t) by the elliptic shape (see Figure 4). Notice that a phase difference of 90± (p / 2) would be computationally more convenient, but is not necessary here. If the two simple cells had a phase difference of 90± , like sine- and cosine-Gabor wavelets, the trajectory would describe circles, and the square of the desired complex cell response a1 ( t) would be just the square sum x22 (t ) C x23 ( t) . Since the phase difference is 45± , the transformation needs to be slightly more complex to represent the elliptic shape of the x3 ( t) -x2 ( t) -trajectory. The fourth and fth simple cells have the same relationship as the second and third one do, but for the right eye instead of the left eye. If the two eyes received identical input, the third and and fth simple cell would also have this relationship—the same amplitude and phase modulation with a constant phase difference. However, disparity induces a shift of the right image versus the left image, which results in an additional slowly varying phase difference w D ( t) . This leads to the slowly varying shape of the ellipses in trajectory plot x5 (t ) versus x3 ( t) , varying back and forth between slim left-oblique ellipses over circles to slim right-oblique ellipses. This phase difference between simple cell responses of different eyes can be used to estimate disparity (Theimer & Mallot, 1994). Since x10 ( t) is a time-delayed version of x5 ( t) , the vertical distance from the diagonal in the trajectory plot x10 ( t) versus x5 (t ) in Figure 4 is related to the time derivative of x5 ( t) . Just as the phase difference between corresponding simple cells of different eyes can be used to estimate disparity, phase changes over time can be used to estimate direction and speed of motion of the visual stimulus (Fleet & Jepson, 1990). In the rst example, x1 ( t) , x2 (t ), and x3 ( t) serve as an input signal. Only the common amplitude modulation a1 ( t) of x2 ( t) and x3 ( t) represents a slow feature and can be extracted from this signal. Example 2 uses all ve normalized simple cell responses, x1 (t ) , . . . , x5 ( t) , plus the time-delayed versions
730
L. Wiskott and T. Sejnowski
of them, x6 ( t) :D x1 (t ¡ D t ) , . . . , x10 (t ) :D x5 ( t ¡ D t) . Several different slow features, such as motion and disparity, can be extracted from this richer signal. Example 1. Consider the normalized simple cell responses x1 ( t) , x2 (t ) , and x3 ( t) as an input signal for SFA. Figure 5 (top) shows the input signal trajectories already seen in Figure 4, cross-sections through the rst three components of the learned input-output function g(x) (arguments not varied are not listed and set to zero, e.g., g1 ( x2 , x3 ) means g1 (0, x2 , x3 )), and the rst three components of the extracted output signal y(t) . The rst component of the input-output function represents the elliptic shape of the x3 ( t) -x2 ( t) trajectory correctly and ignores the rst input signal component x1 (t) . It therefore extracts the amplitude modulation a1 ( t) (actually the square of it) as desired and is insensitive to the phase of the stimulus (cf. Figure 5, bottom right). The correlation2 r between a1 (t ) and ¡y1 (t ) is 0.99. The other input-output function and output signal components are not related to slow features and can be ignored. This becomes clear from the g values of the output signal components (see Figure 5, bottom left). Only g ( y1 ) is signicantly lower than those of the input signal. Phase cannot be extracted, because it is a cyclic variable. A reasonable representation of a cyclic variable would be its sine and cosine value, but this would be almost the original signal and is therefore not a slow feature. When trained and tested on signals of length 16p (2048 data points), the correlation is 0.981 § 0.004 between a1 ( t) and § y1 ( t) (training data) and 0.93 § 0.04 between a01 (t ) and § y01 ( t) (test data) (means over 10 runs § standard deviation). Example 2. The previous example can be extended to binocular input and the time domain. The input signal is now given by all normalized simple cell responses described above and the time-delayed versions of them: x1 ( t) , . . . , x10 (t ). Since the input is 10-dimensional, the output signal of a second-degree polynomial SFA can be potentially 65-dimensional. However, singular value decomposition detects two dimensions with zero variance—the nonlinearly expanded signal has two redundant dimensions, so that only 63 dimensions remain. The g values of the 63 output signal components are shown in Figure 6 (top left). Interestingly, they grow almost linearly with index value j. Between 10 and 16 components vary signicantly more slowly than the input signal. This means that in contrast to the previous example, there are now several slow features being extracted from the input signal. These include elementary features, such as amplitude (or complex cell response) a1 (t) and disparity w D ( t), but also more complex 2
A correlation coefcient r between two signals a (t) and b(t) is dened as r(a, b) :D p
h(a ¡ hai)(b ¡ hbi)i/ h(a ¡ hai)2 ih (b ¡ hbi)2 i. Since the signs of output signal components are arbitrary, they will be corrected if required to obtain positive correlations, which is particularly important for averaging over several runs of an experiment.
Slow Feature Analysis
731
Input output function g(x)
x2
x2
x1
x1 t
g3 x1 ,x2
x1
x1 31.53
y3 t
30.37
y2 t
y1 t
6.52
t
t
values of x(t) and y(t)
t
Correlation a1 (t) vs. y1 (t) 6.61
60
a1 t
a1
50 40
6.52
y1 t
70
t
30
t r 0.99
20
0
2
4
i,j
6
8
y1 t
xi yj
10 0
x2
g2 x1 ,x2
x2
x2
g1 x1 ,x2
x2
x2 t
Output signal y(t)
g3 x2 ,x3
x3
g2 x2 ,x3
x3
x3
Input signal x(t) x2 t x3 t
g1 x2 ,x3
10
a1 t
Figure 5 (Example 1): Learning to extract complex cell response a1 ( t) from normalized simple cell responses x1 ( t) , x2 ( t) , and x3 ( t ) . (Top) First three components of the learned input-output function g(x) , which transforms the input signal x(t) into the output signal y(t). All signal components have unit variance, and all graphs range from ¡4 to C 4, including the gray-value scale of the contour plots. The thick contour lines indicate the value 0, and the thin ones indicate § 2 and § 4; white is positive. Time axes range from 0 to 4p . (Bottom left) g values of the input and the output signal components. Only y1 is a relevant invariant. g(a1 ) is shown at i, j D 0 for comparison. (Bottom right) ¡y1 ( t) correlates well with amplitude a1 ( t) .
732
L. Wiskott and T. Sejnowski
ones, such as phase change wP 1 (t) (which is closely related to the velocity 2 ( ), of stimulus movement), the square of disparity w D t or even the product w D ( t) aP 1 ( t) and other nonlinear combinations of the phase and amplitude signals that were used to generate the input signal. If so many slow features are extracted, the decorrelation constraint is not sufcient to isolate the slow features into separate components. For instance, although the motion signal wP 1 (t) is mainly represented by ¡y8 ( t) with a correlation of 0.66, part of it is also represented by ¡y5 ( t) and ¡y6 ( t) with correlations of 0.34 and 0.33, respectively. This is inconvenient but in many cases not critical. For example, for a repeated application of SFA, such as in the following examples, the distribution of slow features over output signal components is irrelevant as long as they are concentrated in the rst components. We can therefore ask how well the slow features are encoded by the rst 14 output components, for instance. Since the output signals obtained from training data form an orthonormal system, the optimal linear combination to represent a normalized slow signal s ( t) is given by YQ 14 [s](t) :D
14 X hs y j i y j ( t) ,
(4.6)
jD 1
where YQ l indicates a projection operator onto the rst l output signal components. Normalization to zero mean and unit variance yields the normalized projected signal Y14 [s](t) . The correlation of wP 1 (t) with Y14 [wP1 ](t) is 0.94. Figure 6 and Table 1 give an overview over some slow feature signals and their respective correlations r with their projected signals. Since the test signals have to be processed without explicit knowledge about the slow feature signal s0 to be extracted, the linear combinations for test signals are computed with the coefcients determined on the training signal, and we P 0( ) 0 write YQ 014 [s](t) :D 14 j D1 hs y j i y j t . Normalization yields Y 14 [s](t) . 4.2 Example 3: Repeated SFA. The rst two examples were particularly easy because a second-degree polynomial was sufcient to recover the slow features well. The third example is more complex.First generate two random time series, one slowly varying xs (t) and one fast varying x f ( t) (gaussian white noise, low-pass ltered with s D 10 data points for xs and s D 3 data points for x f ). Both signals have a length of 512 data points, zero mean, and unit variance. They are then mixed to provide the raw input signal x˜ (t) :D [x f , sin(2x f ) C 0.5xs ]T , which is normalized to provide x(t) . The task is to extract the slowly varying signal xs ( t) . The input-output function required to extract the slow feature xs cannot be well approximated by a polynomial of degree two. One might therefore use polynomials of third or higher degree. However, one can also repeat the learning algorithm, applying it with second-degree polynomials, leading
Slow Feature Analysis
733
Correlation
(t) vs. Y14 [
a1
80
1
5.24
Y 14
.
60
t
t r 0.97
t
40
Y14
D
xi yj 0 10 20 30 40 50 60 i,j
1
13.09
D
. a 1 ](t)
13.06
Y14
t
t
r 0.98
Y14
Y 14
1
D a1
t
. (t) a 1 (t) vs. Y14 [
t
t r 0.94
D
D
Y14
1
t
Corr.
t
t
](t)
14.94
t
13.74
1
t a1 t
Correlation 1(t) vs. Y14 [
D
D a1
20
t
](t)
D
100
t
D
D
D
3.05
120
0
D
t
values of x(t) and y(t)
1
t
D
t a1 t
Figure 6 (Example 2): Learning to extract several slow features from normalized simple cell responses x1 , x2 , . . . , x10. (Top left) g values of the input and the output signal components. The rst 14 output components were assumed to carry relevant information about slow features. (Top right, bottom left, and right) The rst 14 output components contain several slow features: elementary ones, such as disparity w D and phase variation wP 1 (which is closely related to motion velocity), and combinations of these, such as the product of disparity and amplitude change w D aP 1 . Shown are the slow feature signals, the optimal linear combinations of output signal components, and the correlation trajectories of the two.
SFA1 SFA2 SFA2 SFA2 SFA2
r(s0 , Y014 [s]) r(s, Y14 [s])
g(s0 ) r(s0 , Y010 [s]) j¤ · 14 r(s0 , § y0j¤ )
22 2.0 ¡.03 .07 1.0 0.0 .87 .02 .88 .01 .90 .01
s D wD 32 3.0 ¡.03 .07 2.1 0.3 .80 .07 .89 .03 .91 .02
2 s D wD
37 1.0 ¡.02 .04 2.8 0.4 .78 .11 .89 .01 .90 .01
s D wP D 66 4.0 ¡.00 .08 4.0 0.0 .96 .01 .98 .00 .99 .00
s D a1 66 2.0 .01 .05 6.0 4.2 .04 .02 .02 .04 .22 .06
s D w1 113 2.0 ¡.01 .04 7.9 1.2 .53 .12 .85 .02 .87 .02
s D wP 1 114 5.0 ¡.00 .06 8.7 2.1 .58 .12 .89 .03 .91 .03
s D w D aP 1
Note: SFA1 and SFA2 indicate linear and quadratic SFA, respectively. j¤ is the index of that output signal component y j¤ among the rst 14 that is best correlated with s on the training data. Yl [s] is an optimal linear combination of the rst l output signal components to represent s (see equation 4.6). All gures are means over 10 simulation runs, with the standard deviation given in small numerals. Signal length was always 4096. We were particularly interested in the signals w D (disparity), a1 (complex cell response), and wP 1 (indicating stimulus velocity), but several other signals get extracted as well, some of which are shown in the table. Notice that linear SFA with 10 input signal components can generate only 10 output signal components and is obviously not sufcient to extract any of the considered feature signals. For some feature signals, such as w D and a1 , it is sufcient to use only the single best correlated output signal component. Others are more distributed over several output signal components, for example, wP 1 . Phase w 1 is given as an example of a feature signal that cannot be extracted, since it is a cyclic variable.
Testing Testing Training Testing Testing Training
Data
Table 1: (Example 2): g-Values and Correlations r for Various Slow Feature Signals s Under Different Experimental Conditions.
734 L. Wiskott and T. Sejnowski
Slow Feature Analysis
735
to input-output functions of degree two, four, eight, sixteen, and so on. To avoid an explosion of signal dimensionality, only the rst few components of the output signal of one SFA are used as an input for the next SFA. In this example, only three components were propagated from one SFA to the next. This cuts down the computational cost of this iterative scheme signicantly compared to the direct scheme of using polynomials of higher degree, at least for high-dimensional input signals, for which polynomials of higher degree become so numerous that they would be computationally prohibitive. Figure 7 shows input signal, input-output functions, and output signals for three SFAs in succession (only the rst two components are shown). The plotted input-output functions always include the transformations computed by previous SFAs. The approximation of g(x) to the sinusoidal shape of the trajectories becomes better with each additional SFA; compare, for instance, g2,2 with g3,1 . The g values shown at the bottom left indicate that only the third SFA extracts a slow feature. The rst and second SFA did not yield an g value lower than that of the input signal. For the rst, second, and third SFA, xs ( t) was best correlated with the third, second (with inverted sign), and rst output signal component, with correlation coefcients of 0.59, 0.61, and 0.97, respectively. The respective trajectory plots are shown at the bottom right of Figure 7. Notice that each SFA was trained in an unsupervised manner and without backpropagating error signals. Results for longer signals and multiple simulation runs are given in Table 2. The algorithm can extract not only slowly varying features but also rarely varying ones. To demonstrate this, we tested the system with a binary signal that occasionally switches between two possible values. To generate a rarely changing feature signal xr ( t), we applied the sign function to low-pass ltered gaussian white noise and normalized the signal to zero mean and unit variance. The s of the gaussian low-pass lter was chosen to be 150 to make the g-values of xr (t ) similar to those of xs ( t) in Table 2. Figure 8 shows the rarely changing signal xr ( t) and single best correlated output signal components of the three SFAs in succession. For the rst, second, and third SFA, xr ( t) was best correlated with the third, third, and rst output signal component, with corresponding correlation coefcients of 0.43, 0.69, and 0.97. This is similar to the results obtained with the slowly varying feature signal xs ( t ) . Table 2 shows results on training and test data for slowly and rarely varying feature signals of length 4096. They conrm that there is no signicant difference between slowly and rarely varying signals, given a comparable g value. In both cases, even two quadratic SFAs in succession do not perform much better than one linear one. It is only the third quadratic SFA that extracts the feature signal with a high correlation coefcient. As in the previous example, results improve if a linear combination of several (three) output signal components is used instead of just the single best one (this is
736
L. Wiskott and T. Sejnowski
x2 t
1×SFA2: g(x) and y(t)
27.79
x1
x1 t
x1 40.15
y1,2 t
27.51
y1,1 t
t
g1,2 x1 ,x2
x2
x2
t
x2 t
g1,1 x1 ,x2
x1 t
Input signal x(t)
41.38
t
2×SFA2: g(x) and y(t)
3×SFA2: g(x) and y(t)
x1
x1
x1
t
t
values of x(t) and y(t)
4
i,j
6
8
0.59
0.61
xs t 3×SFA2, r
0.97
y3,1
2×SFA2, r
y2,2
20
1×SFA2, r
y1,3 t
xi y1, j y2, j y3, j 2
Corr. x s(t) vs. y(t) xs t
40
0
t
7.49
xs xf
60
0
27.62
y3,2 t
y3,1 t
t
80
x1 17.87
35.08
y2,2 t
y2,1 t
27.09
g3,2 x1 ,x2
x2
g3,1 x1 ,x2
x2
g2,2 x1 ,x2
x2
x2
g2,1 x1 ,x2
t
10
xs t
xs t
Slow Feature Analysis
737
Table 2 (Example 3): Correlations r for Slowly (cf. Figure 7) and Rarely (cf. Figure 8) Varying Feature Signals xs ( t) and xr ( t) with Output Signal Components of Several SFAs in Succession. 2£SFA2
3£SFA2
4£SFA2
Slowly varying signal xs (t) with gN = 66 1 Training j¤ · 3 2.0 0.0 3.0 0.0 Testing r(x0s , § y0j¤ ) .61 .02 .59 .02 Testing r(x0s , Y30 [xs ]) .60 .02 .60 .02 Training r(xs , Y3 [xs ]) .62 .01 .62 .01
2.7 0.5 .65 .08 .69 .07 .74 .05
1.6 0.5 .81 .10 .87 .06 .89 .05
1.4 0.5 .85 .11 .86 .10 .91 .05
Rarely changing signal xr (t) with gN = 66 8 Training j¤ · 3 2.0 0.0 3.0 0.0 Testing r(x0r , y0j¤ ) .60 .01 .57 .06 Testing r(x0r , Y30 [xr ]) .60 .01 .58 .06 Training r(xr , Y3 [xr ]) .61 .01 .62 .02
2.7 0.5 .59 .10 .67 .10 .73 .07
1.6 0.5 .84 .08 .87 .08 .91 .04
1.3 0.5 .81 .17 .85 .17 .94 .03
Data
1£SFA1
1£SFA2
Note: SFA1 and SFA2 indicate linear and quadratic SFA, respectively. j¤ is the index of that output signal component y j¤ among the rst three, which is best correlated with s on the training data (in all cases, j¤ was also optimal for the test data). Y3 [s] is an optimal linear combination of the rst three output signal components to represent s (see equation 4.6). All gures are means over 10 simulation runs with the standard deviation given in small numerals. Signal length was always 4096.
strictly true only for training data, since on test data, the linear combination from the training data is used). 4.3 Example 4: Translation Invariance in a Visual System Model. 4.3.1 Network Architecture. Consider now a hierarchical architecture as illustrated in Figure 9. Based on a one-dimensional model retina with 65 sensors, layers of linear SFA modules with convergent connectivity alternate with layers of quadratic SFA modules with direct connectivity. This division into two types of sublayers allows us to separate the contribution of simple spatial averaging from the contribution of nonlinear processing Figure 7 (Example 3): Facing page. Hidden slowly varying feature signal discovered by three SFAs in succession. (Top left) Input signal. (Top right and middle left and right) One, two, and three SFAs in succession and their corresponding output signals. First subscript refers to the number of the SFA; second subscript refers to its component. The slow feature signal xs ( t ) is the slow up and down motion of the sine curve in the trajectory plot x2 ( t) versus x1 ( t ) . (Bottom left) g values of the input signal components and the various output signal components. g ( xs ) and g ( x f ) are shown at i, j D 0 for comparison. Only y3, 1 can be considered to represent a slow feature. (Bottom right) Slow feature signal xs ( t) and its relation to some output signal components. The correlation of xs ( t ) with C y1,3 ( t ) , ¡y2,2 ( t) , and C y3,1 ( t) is 0.59, 0.61, and 0.97, respectively.
738
L. Wiskott and T. Sejnowski
t
16.84
t
10.76
xr t
y3,1 t
41.95
y2,3 t
y1,3 t
48.08
t
t
Figure 8 (Example 3): Hidden rarely changing feature signal discovered by three SFAs in succession.
to the generation of a translation-invariant object representation. It is clear from the architecture that the receptive eld size increases from bottom to top, and therefore the units become potentially able to respond to more complex features, two properties characteristic for the visual system (Oram & Perrett, 1994). 4.3.2 Training the Network. In order to train the network to learn translation invariance, the network must be exposed to objects moving translationally through the receptive eld. We assume that these objects create xed one-dimensional patterns moving across the retina with constant speed and same direction. This ignores scaling, rotation, and so forth, which will be investigated in the next section. The effects of additive noise are studied here. The training and test patterns were low-pass ltered gaussian white noise. The size of the patterns was randomly chosen from a uniform distribution between 15 and 30 units. The low-pass lter was a gaussian with a width s randomly chosen from a uniform distribution between 2 and 5. Each pattern was normalized to zero mean and unit variance to eliminate trivial differences between patterns, which would make object recognition easier at the end. The patterns were always moved across the retina with a constant speed of 1 spatial unit per time unit. Thus, for a given pattern and without noise, the sensory signal of a single retinal unit is an accurate image of the spatial gray-value prole of the pattern. The time between one pattern being in the center of the retina and the next one was always 150 time units. Thus, there was always a pause between the presentation of two patterns, and there were never two patterns visible simultaneously. The high symmetry of the architecture and the stimuli made it possible to compute the input-output function only for one SFA module per layer. The other modules in the layer would have learned the same input-output function, since they saw the same input, just shifted by a time delay determined by the spatial distance of two modules. This cut down computational costs signicantly. Notice that this symmetry also results in an implicit weight-sharing constraint, although this was not explicitly implemented (see the next section for more general exam-
Slow Feature Analysis
739
4b
65
SFA2
4a
65
SFA1
3b
33
SFA2
3a
33
SFA1
2b
17
SFA2
2a
17
SFA1
1b
9
SFA2
1a
9
SFA1
retina
1
Figure 9 (Examples 4 and 5): A hierarchical network of SFA modules as a simple model of the visual system learning translation and other invariances. Different layers correspond to different cortical areas. For instance, one could associate layers 1, 2, 3, and 4 with areas V1, V2, V4, and PIT, respectively. The retinal layer has 65 input units, representing a receptive eld that is only a part of the total retina. The receptive eld size of SFA modules is indicated at the left of the left-most modules. Each layer is split into an a and a b sublayer for computational efciency and to permit a clearer functional analysis. The a modules are linear and receive convergent input, from either nine units in the retina or three neighboring SFA modules in the preceding layer. The b modules are quadratic and receive input from only one SFA module. Thus, receptive eld size increases only between a b module and an a module but not vice versa. The number of units per SFA module is variable and the same for all modules. Only the layer 4b SFA module always has nine units to make output signals more comparable. Notice that each module is independent of the neighboring and the succeeding ones; there is no weight-sharing constraint and no backpropagation of error.
ples). The sensory signals for some training and test patterns are shown in Figure 10. In the standard parameter setting, each SFA module had nine units; a stimulus with 20 patterns was used for training, and a stimulus with 50 patterns was used for testing; no noise was added. To improve generalization to test patterns, all signals transferred from one SFA module to the next were clipped at § 3.7 to eliminate extreme negative and positive values. The limit of §3.7 was not optimized for best generalization but chosen such that clipping became visible in the standard display with a range of § 4, which is not relevant for the gures presented here.
740
L. Wiskott and T. Sejnowski
t
77.49
x,1 t
36.85
x,1 t
x1 t
34.68
t
t
Figure 10 (Example 4): Sensory signal of the central retinal unit in response to three patterns. Since all patterns move with the same constant speed across the retina, the sensory signals have the same shape in time as the patterns have in space, at least for the noise-free examples.(Left) First three patterns of the training sensory signal. Pattern characteristics are (29, 4.0), (25, 4.5), and (16, 3.1), where the rst numbers indicate pattern size and the second ones the width of the gaussian low-pass lter used. (Middle) First three patterns of the test sensory signal without noise. Pattern characteristics are (22, 3.1), (26, 3.8), and (24, 2.5). (Right) Same test sensory signal but with a noise level of 1. With noise, the sensory signals of different retinal units differ in shape and not only by a time delay.
4.3.3 What Does the Network Learn? Figure 11 shows the rst four components of the output signals generated by the trained network in response to three patterns of the training and test stimulus: Instantaneous feedforward processing: Once the network is trained, processing is instantaneous. The output signal can be calculated moment by moment and does not require a continuously changing stimulus. This is important, because it permits processing also of briey ashed patterns. However, it is convenient for display purposes to show the response to moving patterns. No overtting:The test output signal does not differ qualitatively from the training output signal, although it is a bit noisier. This indicates that 20 patterns are sufcient for training to avoid overtting. However, clipping the signals as indicated above is crucial here. Otherwise, signals that get slightly out of the usual working range are quickly amplied to extreme values. Longer responses: The response of a layer 4b unit to a moving pattern is longer than the response of a retinal sensor (compare Figures 10 and 11). This is due to the larger receptive eld size. For a pattern of size 20, a retinal sensor response has a length of 20, given a stimulus speed of 1 spatial unit per time unit. A layer 4b unit responds to this pattern as soon as it moves into the receptive eld and until it leaves
Slow Feature Analysis
741
t 14.87
y,2 t
y,1 t
t
t
t
t 5.83
y,3 t
4.1
B
5.73
8.5
y,4 t
t
5.46
y4 t
y2 t
y1 t
4.86
y3 t
3.89
A
t
t
y,3 t
y,4 t
y,2 t
y,3 t
y,1 t
y,4 t
y,3 t
y,2 t
C
y,2 t
y,4 t
y,1 t
y,1 t Figure 11 (Example 4): First four output signal components in response to three patterns of (A) the training signal and (B) the test signal. (C) All respective trajectory plots for the test signal. Time axes range from 0 to 3 £ 150, all other axes range from ¡4 to +4. All signals are approximately normalized to zero mean and unit variance.
742
L. Wiskott and T. Sejnowski
it, resulting in a response of length 84 = 65 (receptive eld size) + 20 (pattern size) ¡1. Translation invariance: The responses to individual patterns have a shape similar to a half or a full sine wave. It may be somewhat surprising that the most invariant response should be half a sine wave and not a constant, as suggested in Figure 1. But the problem with a constant response would be that it would require a sharp onset and offset as the pattern moves into and out of the receptive eld. Half a sine wave is a better compromise between signal constancy in the center and smooth onsets and offsets. Thus the output signal tends to be as translation invariant as possible under the given constraints, invariance being dened by equation 2.1. Where-information: Some components, such as the rst and the third one, are insensitive to pattern identity. Component 1 can thus be used to determine whether a pattern is in the center or periphery of the receptive eld. Similarly, component 3 can be used to determine whether a pattern is more on the left or the right side. Taken together, components 1 and 3 represent pattern location, regardless of other aspects of the pattern. This becomes particularly evident from trajectory plot y03 (t ) versus y01 ( t) . These two components describe a loop in phase space, each point on this loop corresponding to a unique location of the pattern in the receptive eld. y1 ( t) and y3 ( t) therefore represent where-information. What-information: Some components, such as the second and fourth one, do distinguish well among different patterns, despite the translation invariance. The response to a certain pattern can be positive or negative, strong or weak. A reasonable representation for pattern identity can therefore be constructed as follows. Take components 1, 2, and 3, and subtract the baseline response. This yields a three-dimensional response vector for each moment in time, which is zero if no pattern is present in the receptive eld. As a pattern moves through the receptive eld, the amplitude of the response vector increases and decreases, but its direction tends to change little. The direction of the response vector is therefore a reasonable representation of pattern identity. This can be seen in the three trajectory plots y02 ( t) versus y01 ( t), y04 ( t) versus y01 (t) , and y04 (t ) versus y02 ( t) . Ideally the response vector should describe a straight line going from the baseline origin to some extreme point and then back again to the origin. This is what was observed on training data if only a few patterns were used for training. When training was done with 20 patterns, the response to test data formed noisy loops rather than straight lines. We will later investigate quantitatively to what extent this representation permits translation-invariant pattern recognition.
Slow Feature Analysis
743
Two aspects of the output signal were somewhat surprising. First, why does the network generate a representation that distinguishes among patterns, even though the only objective of slow feature analysis is to generate a slowly varying output? Second, why do where- and what-information get represented in separate components, although this distinction has not been built in the algorithm or the network architecture? This question is particularly puzzling since we know from the second example that SFA tends to distribute slow features over several components. Another apparent paradox is the fact that where-information can be extracted at all by means of the invariance objective, even though the general notion is that one wants to ignore pattern location when learning translation invariance (cf. section 1). To give an intuitive answer to the rst question, assume that the optimal output components would not distinguish among patterns. The experiments suggest that the rst component would then have approximately half-sine-wave responses for all patterns; the second component would have full-sine-wave responses for all patterns, since a full sine wave is uncorrelated to a half sine wave and still slowly varying. It is obvious that a component with half-sine-wave responses with different positive and negative amplitudes for different patterns can also be uncorrelated to the rst component but is more slowly varying than the component with full-sinewave responses only, which is in contradiction to the assumption. Thus, the objective of slow variation in combination with the decorrelation constraint leads to components’ differentiating among patterns. Why where- and what-information gets represented in separate components is a more difcult issue. The rst where-component, y1 (t) , with halfsine-wave responses, is probably distinguished by its low g-value, because it is easier for the network to generate smooth responses if they do not distinguish between different patterns, at least for larger numbers of training patterns. Notice that the what-components, y2 ( t) and y4 ( t) , are noisier. It is unclear why the second where-component, y3 ( t) , emerges so reliably (although not always as the third component) even though its g value is comparable to that of other what-components with half-sine-wave responses. For some parameter regimes, such as fewer patterns or smaller distances between patterns, no explicit where-components emerge. However, more important than the concentration of where-information in isolated components is the fact that the where-information gets extracted at all by the same mechanism as the what-information, regardless of whether explicit wherecomponents emerged. It is interesting to compare the pattern representation of the network with the one sketched in Figure 1. We have mentioned that the sharp onsets and offsets of the sketched representation are avoided by the network, leading to typical responses in the shape of a half or full sine wave. Interestingly, however, if one divides Components 2 or 4 by Component 1 (all components taken minus their resting value), one obtains signals similar to
744
L. Wiskott and T. Sejnowski
that suggested in Figure 1 for representing pattern identity: signals that are fairly constant and pattern specic if a pattern is visible and undetermined otherwise. If one divides component 3 by component 1, one obtains a signal similar to those suggested in Figure 1 for representing pattern location: one that is monotonically related to pattern location and undetermined if no pattern is visible. 4.3.4 How Translation Invariant Is the Representation? We have argued that the direction of the response vector is a good representation of the patterns. How translation invariant is this representation? To address this question, we have measured the angle between the response vector of a pattern at a reference location and at all other valid locations. The location of a pattern is dened by the location of its center (or the right center pixel if the pattern has an even number of pixels). The retina has a width of 65, ranging from ¡32 to +32 with the central sensor serving as the origin, that is, location 0. The largest patterns of size 30 are at least partially visible from location ¡46 up to +47. The standard location for comparison is ¡15. The response vector is dened as a subset of output components minus their resting values. For the output signal of Figure 11, for instance, the response vector may be dened as r(t) :D [y1 (t ) ¡ y1 (0) , y2 ( t) ¡ y2 (0) , y4 ( t) ¡ y4 (0)]T , if components 1, 2, and 4 are taken into account and if at time t D 0 no pattern was visible and the output was at resting level. If not stated otherwise, all components out of the nine output signal components that were useful for recognition were taken for the response vectors. In Example 4, the useful components were always determined on the training data based on recognition performance. Since at a given time tpl the stimulus shows a unique pattern p at a unique location l, we can also parameterize the response vector by pattern index p and location l and write rp ( l) :D r(tpl ). The angle between the response vectors of pattern p at the reference location ¡15 and pattern p0 at a test location l is dened as # (rp (¡15) , rp0 ( l) ) :D arccos( (rp (¡15) ¢rp0 ( l) ) / (k rp (¡15) kk rp0 (l) k) ) , (4.7) where r¢r0 indicates the usual inner product and k r k indicates the Euclidean norm. Figure 12 shows percentiles for the angles of the response vectors at a test location relative to a reference location for the 50 test patterns.
4.3.5 What is the Recognition Performance? If the ultimate purpose of the network is learning a translation-invariant representation for pattern recognition, we should characterize the network in terms of recognition performance. This can be done by using the angles between response vectors as a simple similarity measure. But instead of considering the raw recognition rates, we will characterize the performance in terms of the ranking of
Slow Feature Analysis
745
80
angle
60 40 20 0
40
20 0 20 pattern center location
40
Figure 12 (Example 4): Percentiles of the angles # (rp (¡15) , rp ( l ) ) between the response vectors at a test location l and the reference location ¡15 for 50 test patterns p. For this graph, components 1, 2, and 4 were used (cf. Figure 11). The thick line indicates the median angle or 50th percentile. The dashed lines indicate the smallest and largest angles. The bottom and top thin line indicate the 10th and 90th percentiles, respectively. The white area indicates the size of the retina (65 units). Gray areas indicate locations outside the retina. Patterns presented at the edge between white and gray are only half visible to the retina. The average angle over the white area is 12.6 degrees. If the vectors were drawn randomly, the average and median angle would be 90 degrees (except for the reference location where angles are always zero). Thus, the angles are relatively stable over different pattern locations. Results are very similar for the training patterns with an average angle of 11.3 degrees.
the patterns induced by the angles, since that is a more rened measure. Take the response vectors of all patterns at the reference location ¡15 as the stored models, which should be recognized. Then compare a response vector rp0 (l) of a pattern p0 at a test location l with all stored models rp (¡15) in terms of the enclosed angles # (rp (¡15) , rp0 ( l) ) , which induce a ranking. If # (rp0 (¡15) , rp0 ( l) ) is smaller than all other angles # (rp (¡15) , rp0 ( l) ) with 6 p0 , then pattern p0 can be recognized correctly at location l, and it is on pD rank 1. If two other patterns have a smaller angle than pattern p0 , then it is on rank 3, and so on. Figure 13 illustrates how the ranking is induced, and Figure 14 shows rank percentiles for all 20 training patterns and 50 test patterns over all valid locations.
746
L. Wiskott and T. Sejnowski
y4 Obj. 1 at 15 Obj. 3 at 15 2
1 Obj. 4 at 15
Obj. 2 at +15 induces ranking Obj. 4 Obj. 5 Obj. 2 . .
3 2
1
3
Obj. 1 at +15 induces ranking Obj. 1 Obj. 3 Obj. 2 . .
y2
Obj. 2 at 15 Obj. 5 at 15
Figure 13 (Examples 4 and 5): Pattern recognition based on the angles of the response vectors. Solid arrows indicate the response vectors for objects at the reference location. Dashed arrows indicate the response vectors for objects shown at a test location. The angles between a dashed vector and the solid vectors induce a ranking of the stored objects. If the correct object is at rank one, the object is recognized correctly, as in case of Object 1 in the gure.
For the training patterns, there is a large range within which the median rank is 1; at least 50% of the patterns would be correctly recognized. This even holds for location +15, where there is no overlap between the pattern at the test location and the pattern at the reference location. Performance degrades slightly for the test patterns. The average ranks over all test locations within the retina (the white area in the graph) is 1.9 for the training patterns and 6.9 for the test patterns. Since these average ranks depend on the number of patterns, we will normalize them by subtracting 1 and dividing by the number of patterns minus 1. This gives a normalized average rank between 0 and 1, 0 indicating perfect performance (always rank 1) and 1 indicating worst performance (always last rank). The normalized average ranks for the graphs in Figure 14 are 0.05 and 0.12 for the training and test patterns, respectively; chance levels were about 0.5. Notice that the similarity measure used here is simple and gives only a lower bound for the performance; a more sophisticated similarity measure can only do better.
Slow Feature Analysis
747
Figure 14 (Example 4): Rank percentiles for the 20 training patterns (top) and 50 test patterns (bottom) depending on the test location. For this graph, components 1, 2, and 4 were used (cf. Figure 11). Conventions are as for Figure 12. Performance is, of course, perfect at the reference location ¡15, but also relatively stable over a large range. The average rank over the white area is 1.9 for the training patterns and 6.2 for the test patterns. Perfect performance would be 1 in both cases, and chance level would be approximately 10.5 and 25.5 for training and test patterns, respectively.
748
L. Wiskott and T. Sejnowski
normalized average rank
0.5
2. 3. 3.8
0.4 0.3
5.
0.2
5.2 3.6
3.6
0.1 0
2
5
10
20 number of training patterns
40
Figure 15 (Example 4): Dependence of network performance on the number of training patterns. The standard number was 20. The ordinate indicates normalized average rank; the abscissa indicates the considered parameter—the number of training patterns in this case. Gray and black curves indicate performance on training and test data, respectively. Solid curves indicate results obtained with two-layer 4b output signal components only; dashed curves indicate results obtained with as many components out of the rst nine components as were useful for recognition. The useful components were determined on the training data based on recognition performance. The average number of components used for the dashed curves is shown above. Each point in the graph represents the mean over ve different simulation runs; the standard deviation is indicated by small bars. The curves are slightly shifted horizontally to let the standard deviation bars not overlap. Twenty training patterns were sufcient, and even ve training patterns produced reasonable performance.
4.3.6 How Does Performance Depend on Number of Training Patterns, Network Complexity, and Noise Level? Using the normalized average rank as a measure of performance for translation-invariant pattern recognition, we can now investigate the dependence on various parameters. Figure 15 shows the dependence on the number of training patterns. One nds that 20 training patterns are enough to achieve good generalization; performance does not degrade dramatically down to ve training patterns. This was surprising, since translation-invariant recognition appears to be a fairly complex task. Figure 16 shows the dependence of network performance on the number of components used in the SFA modules. The standard was nine for all modules. In this experiment, the number of components propagated from
Slow Feature Analysis
normalized average rank
0.5
749
2. 3.8
0.4
2. 2.8
0.3
3.2 3.6
0.2
3.6
0.1 0
1
2
3 4 5 6 7 8 number of propagated components
9
Figure 16 (Example 4): Dependence of network performance on the number of components propagated from one module to the next. Standard value was 9. Conventions are as for Figure 15. Performance degraded slowly down to four components per module and more quickly for fewer components.
one module to the next one was stepwise reduced to just one. At the top level, however, usually nine components were used to make results more comparable. In the case of one and two components per module, however, only two and ve components were available in the layer 4b module, respectively. Performance degraded slowly down to four components per module; below that, performance degraded quickly. With one component, the normalized average rank was at chance level. One can expect that performance would improve slightly with more than nine components per module. With one and two components per module, there were always two clear components in the output signal useful for recognition. With three components per module, the useful information began to mix with components not useful, so that performance actually degraded from two to three components per module. SFA should not be too sensitive to noise, because noise would yield quickly varying components that are ignored. To test this, we added gaussian white noise with different variance to the training and test signals. A sensory signal with a noise level of 1 is shown in Figure 10. The noise was independent only for nine adjacent sensors feeding into one SFA module. Because in the training procedure, only one module per layer was being trained, all modules in one layer effectively saw the same noise, but delayed by 4, 8, 12, and so on time units. However, since the network processed the
750
L. Wiskott and T. Sejnowski
normalized average rank
0.5 2.6 0.4
2.4
0.3 0.2
2.2
3.8
2.8
3.6
0.1 0
0
0.5
1 noise level
1.5
2
Figure 17 (Example 4): Dependence of network performance on the added noise level. Standard value was 0. Conventions are as for Figure 15. Performance degrades gracefully until it approaches chance level at a noise level of 2. A sensory signal with noise level 1 is shown in Figure 10.
input instantaneously, it is unlikely that the delayed reoccurence of noise could be used to improve the robustness. Figure 17 shows the dependence of performance on noise level. The network performance degraded gracefully. 4.3.7 What is the Computational Strategy for Achieving Translation Invariance? How does the network achieve translation invariance? This question can be answered at least for a layer 1b SFA module by visualizing its input-output function (not including the clipping operation from the 1a to the 1b module). At that level, the input-output function is a polynomial of nine input components, for example, x ¡4 , x ¡3 , . . . , x4 , and degree 2; it is a weighted sum over all rst- and second-degree monomials. Let us now dene a center location and a spread for each monomial. For seconddegree monomials, the spread is the difference between the indices of the two input components; its center location is the average over the two indices. For rst-degree monomials, the spread is 0, and the center location is the index. Thus, x ¡3 x 0 has a spread of 3 and center location ¡1.5, x3 x3 has a spread of 0 and center location 3, and x ¡2 has a spread of 0 and center location ¡2. Figure 18 shows the weights or coefcient values of the monomials for the rst two components of the input-output function learned by the central layer 1b SFA module, including the contribution of
Slow Feature Analysis
751
Figure 18 (Example 4): Visualization of the rst two components, g1 (x) (top) and g2 (x) (bottom), of the input-output function realized by the central SFA module of layer 1b. Black dots indicate the coefcient values of the seconddegree monomials at their center location. Dots related to monomials of the same spread are connected by dashed lines, dense dashing indicating a small spread and wide dashing indicating a large spread. The spread can also be inferred from the number of dots connected, since there are nine monomials with spread 0, eight monomials with spread 1, and so on. The thick line indicates the coefcient values of the rst-degree monomials. Second-order features get extracted, and coefcient values related to same spread vary smoothly with center location to achieve a slowly varying output signal (i.e., approximate translation invariance).
752
L. Wiskott and T. Sejnowski
the preceding 1a SFA module. All other modules in the same layer learn the same input-output function just shifted by a multiple of four units. The graphs show that second-degree monomials contributed signicantly (signals generated by rst- and second-degree monomials have similar variance, so that coefcient values of rst- and second-degree monomials can be compared), which indicates that second-order features were extracted. In general, coefcient values related to monomials of the same degree and spread increase smoothly from periphery to center, forming bell-shaped or half-sine-wave-shaped curves. This results in a fade-in / fade-out mechanism applied to the extracted features as patterns move through the receptive eld. The general mechanism by which the network learns approximate translation invariance is closely related to higher-order networks in which translation invariance is achieved by weight sharing among monomials of same degree and spread (see Bishop, 1995). The difference is that in those networks, the invariance is built in by hand and without any fade-in / fadeout. What is the relative contribution of the linear a-sublayers and the quadratic b-sublayers to the translation invariance? This can be inferred from the g values plotted across layers. In Figure 19, a and b sublayers both contribute to the slow variation of the output signals, although the linear a sublayers seem to be slightly more important, at least beyond layer 1b. This holds particularly for the rst component, which is usually the half-sine-waveresponding where-component, for which simple linear averaging seems to be a good strategy at higher levels. 4.3.8 Is There an Implicit Similarity Measure Between Patterns? The representation of a given pattern is similar at different locations: this is the learned translation invariance. But how does the network compare different patterns at the same location? What is the implicit similarity measure between patterns given that we compare the representations generated by the network in terms of the angles between the response vectors? Visual inspection did not give an obvious insight into what the implicit similarity measure would be, and also comparisons with several other similarity measures were not conclusive. For example, the correlation coefcient between the maximum correlation between pairs of patterns over all possible relative shifts on the one hand and the angle between the response vectors (presenting the patterns in the center of the retina and using components 1, 2, and 4) on the other hand was only ¡0.47, and the respective scatterplot was quite diffuse. There was no obvious similarity measure implicit in the network. 4.3.9 How Well Does the System Generalize to Other Types of Patterns? We have addressed this question with an experiment using two different types of patterns: high-frequency patterns generated with a gaussian low-pass lter with s D 2 and low-frequency patterns generated with s D 10 (see
Slow Feature Analysis
753
4
log10
3.5 3 2.5
retina 1a
1b
2a
2b layer
3a
3b
4a
4b
Figure 19 (Example 4): g values of the nine signal components in the different layers averaged over ve simulation runs. Dashed and solid curves indicate the training and test cases, respectively. The bottom curve in each of the two bunches indicates the rst component with the lowest g value, the top one the ninth component. A difference of 0.397 D log10 (2.5) between the dashed and solid curves can be explained by the fact that there were 2.5 times more test than training patterns. The remaining difference and the fact that the solid curves are not monotonic and cross each other reect limited generalization. The step from retina to layer 1a is linear and performs only decorrelation; no information is lost. Linear as well as nonlinear processing contribute to the translation invariance. For the rst where-component, the linear convergent processing seems to be more important.
Figure 10 for patterns with different values of s). Otherwise, the training and testing procedures were unchanged. Average angles and normalized average ranks are given in Table 3 and indicate that (1) translation invariance generalized well to the pattern type not used for training, since the average angles did not differ signicantly for the two different testing conditions given the same training condition; (2) the invariance was in general better learned with low-frequency patterns than with high-frequency patterns (although at the cost of high variance), since the average angles were smaller for all testing conditions if the network was trained with low-frequency patterns; (3) low-frequency patterns were more easily discriminated than high-frequency patterns, since the normalized average rank was on average lower by 0.04 for the test patterns with s D 10 than for those with s D 2; (4) pattern recognition generalized only to a limited degree, since
754
L. Wiskott and T. Sejnowski
Table 3 (Example 4): Average Angles and Normalized Average Ranks for the Networks Trained for One Pattern Type (s D 2 or s D 10) and Tested on the Training Data or Test Data of the Same or the Other Pattern Type. Testing )
+ Training
sD2 Training
Average angles sD2 15.2 1.5 s D 10
s D 10 Testing
Training
Testing
18.1 2.2 11.5 7.0
6.4 4.3
19.0 1.6 8.9 5.3
.06 .04
.15 .01 .09 .02
Normalized average ranks sD2 .08 .02 .13 .03 s D 10 .19 .01
Note: All gures are means over ve simulation runs with the standard deviation given in small numerals. Boldface numbers indicate performances if a network was tested on the same pattern type for which it was trained. The components useful for recognition were determined on the training data. Training (testing) was done with identical patterns for all networks within a row (column).
the normalized average rank increased on average by 0.06 if training was not done on the pattern type used for testing. Notice that recognition performance dropped (point 4) even though the invariance generalized well (point 1). Response vectors for the pattern type not used for training seemed to be invariant but similar for different patterns so that discrimination degraded. 4.4 Example 5: Other Invariances in the Visual System Model. 4.4.1 Training the Network. In this section, we extend our analysis to other invariances. To be able to consider not only geometrical transformations but also illumination variations, patterns are now derived from object proles by assuming a Lambertian surface reectance (Brooks & Horn, 1989). Let the depth prole of a one-dimensional objectqbe given by q (u ) and its surface normal vector by n(u) :D [¡q0 ( u ) , 1]T / q0 2 ( u ) C 12 with q0 ( u ) :D dq(u) / du. If a light source shines with unit intensity from direction l D (sin(a) , cos(a) ) (where unit vector (0, 1) would point away from the retina and (1, 0) would point toward the right of the receptive eld), the pattern of reected light is i (u ) D max (0, n(u) ¢ l). To create an object’s depth prole, take gaussian white noise of length 30 pixels, low-pass lter it with cyclic boundary conditions by a gaussian with a width s randomly drawn from the interval [2, 5], and then normalize it to zero mean and unit variance. This is then taken to be the depth derivative q0 ( u ) of the object. The
Slow Feature Analysis
755
depth prole itself could be derived by integration but is not needed here. Applying the formula for the Lambertian reectance given above yields the gray-value pattern i (u ) . Varying a (here between ¡60± and +60± ) yields different patterns for the same object. In addition, the object can be placed at different locations (centered anywhere in the receptive eld), scaled (to a size between 2 and 60 pixels), rotated (which results in a cyclic shift up to a full rotation), and varied in contrast (or light intensity, between 0 and 2). Thus, beside the fact that each pattern is derived from a randomly drawn object prole, it has specic values for the ve parameters location, size, cyclic shift, contrast, and illumination angle. If a certain parameter is not varied, it usually takes a standard value, which is the center of the receptive eld for the location, a size of 30 pixels, no cyclic shift for rotation, a contrast of 1, and ¡30± for the light source. Notice that location, size, and contrast have values that can be uniquely inferred from the stimulus if the object is entirely within the receptive eld. Cyclic shift and illumination angle are not unique, since identical patterns could be derived for different cyclic shift and illumination angle if the randomly drawn object proles happened to be different but related in a specic way—for example, if two objects are identical except for a cyclic shift. Between the presentation of two objects, there is usually a pause of 60 time units with no stimulus presentation. Some examples of stimulus patterns changing with respect to one of the parameters are given in Figure 20 (top). First, the network was trained for only a single invariance at a time. As in the previous section, 20 patterns were usually used for training and 50 patterns for testing. The patterns varied in only one parameter and had standard values for all other parameters. In a second series, the network was trained for two or three invariances simultaneously. Again, the patterns of the training stimulus varied in just one parameter at a time. The other parameters either had their standard values, if they belonged to an invariance not trained for, or they had a random value within the valid range, if they belonged to an invariance trained for. For example, if the network was to be trained for translation and size invariance, each pattern varied in either location or size while having a randomly chosen but constant value for size or location, respectively (cf. Figure 20, bottom). All patterns had standard parameter values for rotation, contrast, and illumination angle. This training procedure comes close to the natural situation in which each pattern occurs many times with different sets of parameters, is computationally convenient, and permits a direct comparison between different experiments, because identical stimuli are used in different combinations. In some experiments, all nonvaried parameters had their standard value; in the example given above, the patterns varying in location had standard size, and the patterns varying in size had standard location. In that case, however, the network can learn translation invariance only for patterns of standard size and, simultaneously, size invariance for patterns of standard
retinal sensor
L. Wiskott and T. Sejnowski
30 15 0 15 30
retinal sensor
756
30 15 0 15 30
0
100
200 time
300
400
0
100
200 time
300
400
Figure 20 (Example 5): (Top) Stimulus patterns of the same object generated by changing object location from the extreme left to the extreme right in steps of 1, object size from 2 to 30 and back to 2 in steps of 2, cyclic shift from 0 to twice the object size in steps of 1 (resulting in two full rotations), contrast (or light intensity) from 1/15 to 2 and back again in steps of 1/15, and illumination angle from ¡60± to +60± in steps of 2± . (Bottom) Stimulus for training translation and size invariance. Patterns derived from two different objects are shown. The rst two patterns have constant but randomly chosen size and vary in location. The second two patterns have constant but randomly chosen location and vary in size. The rst and third patterns were derived from the same object, and so were the second and fourth patterns. Cyclic shift (rotation), contrast, and illumination angle have their standard values. The pause between two presentations of an object is only 14 or 24 time units instead of 60 in the graphs for display purposes.
location. No invariances can be expected for patterns of nonstandard size and location. 4.4.2 What Does the Network Learn? Figure 21 shows the rst four components of the output signal generated by a network trained for size invariance in response to 10 objects of the training stimulus. In this case, the stimuli consisted of patterns of increasing and decreasing size, like the second example in Figure 20 (top). The response in this case and also for the other invariances is similar to that in case of translation invariance (see Figure 11), with the difference that no explicit where-information is visible. Thus, object recognition should also be possible for these other invariances based on the response vectors. 4.4.3 How Invariant Are the Representations? Figure 22 shows rank percentiles for the ve invariances. For each but the top right graph, a network
Slow Feature Analysis
757
Figure 21 (Example 5): First four output signal components of the network trained for size invariance in response to 10 objects of the training stimulus with patterns changing in size only. At the bottom are also shown some trajectory plots. Time axes range from 0 to 10 £ 119, where 119 corresponds to the presentation of a single pattern including a pause of 60 time units with no stimulation. All other axes range from ¡4 to +4. All signals are approximately normalized to zero mean and unit variance.
was trained with 20 patterns varying in one viewing parameter and having standard values for the remaining four ones. Fifty test patterns were used to determine which of the output components were useful for recognition. Fifty different test patterns were then used to determine the rank percentiles shown in the graphs. Notice that the graph shown for translation invariance differs from the bottom graph in Figure 14 because of differences in the reference location, the number of components used for recognition, and the characteristics of the patterns. The recognition performance was fairly good over a wide range of parameter variations, indicating that the networks have learned invariances well. For varying illumination angle, however, the network did not generalize if the angle had a different sign, that is, if the light comes from the other side. The top right graph shows the performance of a network trained for translation and size invariance simultaneously and tested for translation invariance. 4.4.4 What Are the Recognition Performances? The normalized average ranks were used as a measure of invariant recognition performance, as for translation invariance in section 4.3. Normalized average ranks are listed in Table 4 for single and multiple invariances. The numbers of output signal components used in each experiment are shown in Table 5. Although normalized average ranks for different invariances cannot be directly compared, the gures suggest that contrast and size invariance are
758
L. Wiskott and T. Sejnowski
Networks trained for Translation and size invariance
50
50
40
40
30
30
rank
rank
Networks trained for Translation invariance
20 10
20 10
0
30
20 10 0 10 20 pattern center location
0
30
30
50
40
40
30
30
20 10
20 10
0
10
20
30 40 pattern size
50
0
60
10
5
0 5 cyclic shift
10
15
Illumination invariance 50
40
40
30
30
rank
rank
Contrast invariance 50
20 10 0
30
Rotation invariance
50
rank
rank
Size invariance
20 10 0 10 20 pattern center location
20 10
0
0.5
1 1.5 2 pattern contrast
2.5
3
0
60
40
20 0 20 illumination angle
40
60
Figure 22 (Example 5): Rank percentiles for different networks and invariances and for 50 test patterns. The top right graph shows percentiles for a network that has been trained for translation and size invariance simultaneously and was tested for translation invariance; all other networks have been trained for only one invariance and tested on the same invariance. The thick lines indicate the median ranks or 50th percentiles. The dashed lines indicate the smallest and largest ranks. The bottom and top thin lines indicate the 10th and 90th percentiles, respectively. Chance level would be a rank of 25. Performance is perfect for the standard view, but also relatively stable over a wide range.
Slow Feature Analysis
759
easier to learn than translation invariance, which is easier to learn than rotation and illumination invariance. Contrast invariance can be learned perfectly if enough training patterns are given. This is not surprising, since the input patterns themselves already form vectors that simply move away from the origin in a straight line and back again as required for invariant object recognition. That contrast requires many training patterns for good generalization is probably because a single pattern changing in contrast spans only one dimension. Thus, at least 30 patterns are required to span the relevant space (since patterns have a size of 30). The performance is better for size invariance than for translation invariance, although translation is mathematically simpler, probably because individual components of the input signal change more drastically with translation than with scaling, at least around the point with respect to which objects are being rescaled. Illumination invariance is by far the most difcult of the considered invariances to learn, in part because there is no unique relationship between illumination angle, the object’s depth prole, and the light intensity pattern. This also holds to some degree for rotation invariance, on which the network performs second worst. Illumination invariance seems to be particularly difcult to learn if the illumination angle changes sign—if the light comes from the other side (cf. Figure 22). It is also interesting to look at how well invariances have been implicitly learned for which the network has not been trained. If trained for translation invariance, for instance, the network also learns size invariance fairly well. Notice that this is not symmetric. If trained for size invariance, the network does not learn translation invariance well. A similar relationship holds for contrast and illumination invariances. Learning illumination invariance teaches the network some contrast invariance, but not vice versa. A comparison of the results listed in the bottom of Table 4 with those listed in the top shows that performance degrades when the network is trained on multiple invariances simultaneously. Closer inspection shows that this is due to at least two effects. First, if trained on translation invariance, for instance, patterns vary only in location and have standard size. If trained on translation and size invariance, training patterns that vary in location have a random rather than a standard size. Thus, the network has to learn translation invariance not only for standard-size patterns but also for nonstandard-size patterns. Thus, size becomes a parameter of the patterns the network can represent, and the space of patterns is much larger now. However, since testing is done with patterns of standard size (this is important to prevent patterns from being recognized based on their size), pattern size cannot be used during testing. This effect can be found in a network trained only for translation invariance, but with patterns of random size (compare entries (or ) with entries (or ) in Table 6).
.26 .02
.41 .04
.31 .00
.38 .02
.06 .01 .46 .01 .37 .01 .49 .01 .49 .00 .18 .03 .27 .06 .47 .01
.04 .01
.14 .04 .21 .06
Testing
Training
Location
.06 .00
.07 .03
.04 .01
.01 .00
Training
Size
.34 .01
.19 .02
.33 .02
.15 .02 .03 .01 .33 .03 .42 .04 .44 .03 .13 .01 .22 .04 .17 .05
Testing
.24 .02 .31 .08
.14 .06
.22 .02
.32 .03 .42 .00 .12 .01 .48 .01 .47 .01 .26 .04 .20 .03 .47 .01
Testing
.03 .01
.13 .04
.03 .01
.03 .01
Training
Rotation
.19 .04
.00 .00
Training
.28 .03
.43 .02
.41 .01
.36 .02 .32 .02 .38 .02 .06 .03 .18 .02 .38 .05 .39 .01 .35 .02
Testing
Contrast
.12 .02
.22 .03
.13 .04
.06 .01
Training
.30 .04
.38 .01
.26 .02
.37 .01 .34 .01 .31 .02 .34 .05 .22 .01 .40 .02 .35 .04 .28 .03
Testing
Illumination
Note: All gures are means over three simulation runs with the standard deviation given in small numerals. Boldface numbers indicate performances if the network was tested on an invariance for which it was trained. Training performance was based on the same output signal components as used for testing. Training (testing) was done with identical patterns for all networks within a row (column).
Location Size Rotation Contrast Illumination Location and size Location and rotation Size and illumination Rotation and illumination Location, size and rotation Rotation, contrast, and illumination
+ Training
Testing )
Table 4: (Example 5): Normalized Average Ranks for Networks Trained for One, Two, or Three of the Five Invariances and Tested with Patterns Varying in One of the Five Parameters.
760 L. Wiskott and T. Sejnowski
Slow Feature Analysis
761
Table 5 (Example 5): Numbers of Output Signal Components Useful for Recognition. Training + Testing )
Location
Size
Rotation
Contrast
Location Size Rotation Contrast Illumination Location and size Location and rotation Size and illumination Rotation and illumination Location, size, and rotation Rotation, contrast, and illumination
6.0 0.0 3.7 1.2 6.7 0.6 3.7 2.9 4.7 1.2 8.0 1.0 6.3 1.2 3.3 0.6
6.0 1.0 6.7 0.6 5.7 2.3 6.0 2.6 5.7 3.1 7.3 0.6 6.3 0.6 7.3 1.5
4.3 0.6 3.3 0.6 8.0 1.0 2.3 0.6 3.7 1.2 4.3 0.6 5.0 1.7 4.0 1.0
5.3 0.6 5.3 2.1 4.3 1.5 8.7 0.6 8.7 0.6 4.3 0.6 4.3 1.5 7.0 1.7
5.0 1.7 4.0 1.0 4.7 1.5 5.3 2.5 6.7 0.6 4.3 1.2 4.3 0.6 5.7 0.6
7.0 0.0
6.0 1.0
6.0 .00
6.0 1.0
5.0 2.0
6.3 2.3
6.7 0.6
6.7 1.5
4.0 1.0
4.3 0.6
5.0 1.0
5.3 0.6
4.0 1.0
5.7 0.6
4.7 1.5
Illumination
Note: All gures are means over three simulation runs, with the standard deviation given in small numerals. Networks were trained for one, two, or three of the ve invariances and tested with patterns varying in one of the ve parameters. Boldface numbers indicate experiments where the network was tested on an invariance for which it was trained.
Second, the computational mechanisms by which different invariances are achieved may not be compatible and so would interfere with each other (compare entries (or ) with entries (or ) in Table 6). Further research is required to investigate whether this interference between different invariances is an essential problem or one that can be overcome by using more training patterns and networks of higher complexity and more computational power, and by propagating more components from one SFA module to the next. Table 6 shows that the amount of degradation due to each of the two effects discussed above can vary considerably. Translation invariance does not degrade signicantly if training patterns do not have standard size, but it does if size invariance has to be learned simultaneously. Size invariance, on the other hand, does not degrade if translation invariance has to be learned simultaneously, but it degrades if the patterns varying in size are not at the same (standard) location. Some of these differences can be understood intuitively. For instance, it is clear that the learned translation invariance is largly insensitive to object size but that the learned size invariance is specic to the location at which it has been learned. Evaluating the interdependencies among invariances in detail could be the subject of further research. 4.4.5 How Does Performance Depend on the Number of Training Patterns and Network Complexity? The dependencies on the number of training patterns
762
L. Wiskott and T. Sejnowski
Table 6 (Example 5): Normalized Average Ranks for the Network Trained on One or Two Invariances and Tested on One Invariance. Location ( .06 .01 .16 .02 .46 .01
.10 .03 .18 .03 .40 .00 Rotation (
.12 .01 .17 .03 .47 .01
,
) and Size (
,
.15 .02 .05 .01 .03 .01 ) and Illumination (
.14 .02 .22 .02 .45 .03
)
.31 .02 .28 .03 .22 .01
.22 .02 .13 .01 .14 .04 ) .30 .01 .26 .02 .24 .01
Note: The icons represent the two-dimensional parameter space of two invariances considered; all other parameters have standard values. In the top part of the table, for instance, the vertical dimension refers to location and the horizontal dimension refers to size. The icons then have the following meaning: = 20 training (or 50 test) input patterns with standard size and varying location. = 20 (or 50) input patterns with standard location and varying size. = 20 (or 50) input patterns with random but xed size and varying location. = 20 (or 50) input patterns with random but xed location and varying size. = input patterns and together—40 (or 100) input patterns in total. = input patterns and together—40 (or 100) input patterns in total. The same scheme also holds for any other pair of two invariances. Two icons separated by a colon indicate a training and a testing stimulus. The experiments of the top half of Table 4 would correspond to icons (or ) if the tested invariance is the same as the trained one (boldface gures) and (or ) otherwise. The experiments of the bottom half of Table 4, where multiple invariances were learned, would correspond to icons (or ) if the tested invariance is one of the trained ones (boldface gures). The remaining experiments in the bottom half cannot be indicated by the icons used here, because the networks are trained on two invariances and tested on a third one.
and network complexity are illustrated in Figure 23. It shows normalized average ranks on training and test stimuli for the ve different invariances and numbers of training patterns, as well as different numbers of components propagated from one SFA module to the next. As can be expected, training performance degrades and testing performance improves with the number of training patterns. One can extrapolate that for many training patterns, the performance would be about .050 for location, .015 for size, .055 for rotation, .000 for contrast, and .160 for illumination angle, which conrms the general results we found in Table 4 as to how well invariances can be learned, with the addition that contrast can be learned perfectly given enough training patterns. Testing performance also improves with network complexity, that is, with the number of propagated components, except when overtting occurs, which is apparently the case at the bottom ends of the dashed curves. Contrast invariance generally degrades with network complexity. This is due to the fact that the input vectors already form a perfect representation for invariant recognition (see above),
Slow Feature Analysis
norm. av. rank on train. data
0.14
763
location size rotation contrast illumin. diagonal
0.12 0.1 0.08 0.06 0.04 0.02 0 0
0.05 0.1 0.15 0.2 0.25 normalized average rank on test data
Figure 23 (Example 5): Normalized average ranks on training and test stimuli for the ve invariances. Solid lines and lled symbols indicate the dependency on the number of training patterns, which was 10, 20, 40, and 80 from the lower right to the upper left of each curve. Dashed lines and empty symbols indicate the dependency on the number of components propagated from one SFA module to the next one, which was 5, 7, 9, 11, 13, and 15 from the upper end of each curve to the lower end. The direction of increasing number of components can also be inferred from the polarity of the dashing, which follows the pattern 5 - — - — 7 - — . . . - — 15 (i.e., long dashes point toward large numbers of propagated components). If not varied, the number of training patterns was 20 and the number of propagated components was 9. Notice that the curves cross for these standard values. The solid curve for contrast invariance is shifted slightly downward to make it distinguishable from the dashed curve. Each point is an average over three simulation runs; standard deviations for points with standard parameter sets can be taken from Table 4.
which can only be degraded by further processing, and the fact that each pattern spans only one dimension, which means that overtting occurs easily. 5 Discussion The new unsupervised learning algorithm, slow feature analysis (SFA), presented here yields a high-dimensional, nonlinear input-output function that extracts slowly varying components from a vectorial input signal. Since the learned input-output functions are nonlinear, the algorithm can be applied repeatedly, so that complex input-output functions can be learned in a hierarchical network of SFA modules with limited computational effort.
764
L. Wiskott and T. Sejnowski
SFA is somewhat unusual in that directions of minimal variance rather than maximal variance are extracted. One might expect that this is particularly sensitive to noise. Noise can enter the system in two ways. First, there may be an additional input component carrying noise. Assume an input signal as in Example 1 plus a component x4 ( t) that has a constant value plus some small noise. x4 ( t) would seem to be most invariant. However, normalizing it to zero mean and unit variance would amplify the noise such that it becomes a highly uctuating signal, which would be discarded by the algorithm. Second, noise could be superimposed on the input signal that carries the slowly varying features. This could change the D values of the extracted signals signicantly. However, a slow signal corrupted by noise will usually be slower than a fast signal corrupted by noise of the same amount, so that the slow signal will still be the rst one extracted. Only if the noise is unevenly distributed over the potential output signal components can one expect the slow components not to be correctly discovered. Temporal low-pass ltering might help dealing with noise of this kind to some extent. The apparent lack of an obvious similarity measure in the network of section 4.3 might reect an inability to nd it. For instance, the network might focus on features that are not well captured by correlation and not easily detectable by visual inspection. Alternatively, no particular similarity measure may be realized, and similarity may be more a matter of chance. On the negative side, this contradicts the general goal that similar inputs should generate similar representations. However, on the positive side, the network might have enough degrees of freedom to learn any kind of similarity measure in addition to translation invariance if trained appropriately. This could be valuable, because simple correlation is actually not the best similarity measure in the real world. 5.1 Models for Unsupervised Learning of Invariances. Foldi ¨ ak (1991) presented an on-line learning rule for extracting slowly varying features based on models of classical conditioning. This system associated the existing stimulus input with past response activity by means of memory traces. This implemented a tendency to respond similarly to stimuli presented successively in time, that is, a tendency to generate slow feature responses. A weight growth rule and normalization kept the weights nite and allowed some useful information to be extracted. A simple form of decorrelation of output units was achieved by winner-take-all competition. This, however, did not lead to the extraction of different slow features but rather to a one-of-n code, where each output unit codes for a certain value of the slow feature. The activity of a unit was linear in the inputs, and the output was a nonlinear function of the total input and the activities of neighboring units. Closely related models were also presented in (Barrow and Bray 1992, O’Reilly and Johnson 1994, and Fukushima 1999) and applied to learning lighting and orientation invariances in face recognition in (Stewart Bartlett and Sejnowski 1998). These kinds of learning rules have also been applied to
Slow Feature Analysis
765
hierarchical networks with several layers, most clearly in (Wallis and Rolls 1997). A more detailed biological model of learning invariances also based on memory traces was presented by Eisele (1997). He introduced an additional memory trace to permit association between patterns not only in the backward direction but also in the forward direction. Another family of on-line learning rules for extracting slowly varying features has been derived from information-theoretic principles. Becker and Hinton (1992) trained a network to discover disparity as an invariant variable of random-dot stereograms. Two local networks, called modules, received input from neighboring but nonoverlapping parts of the stereogram, where each module received input from both images. By maximizing the mutual information between the two modules, the system learned to extract the only common aspect of their input—the disparity, given that disparity changed slowly. This is an example of spatial invariance, in contrast to temporal invariance considered so far. Spatial and temporal invariance are closely related concepts, and algorithms can usually be interchangeably applied to one or the other domain. An application to the temporal domain, for instance, can be found in (Becker 1993). For an overview, see (Becker 1996). The information-theoretic approach is appealing because the objective function is well motivated. This becomes particularly obvious in the binary case, where the consequent application of the principles is computationally feasible. However, in the case of continuous variables, several approximations need to be made in order to make the problem tractable. The resulting objective function to be maximized is I :D 0.5 log
V ( a C b) , V ( a ¡ b)
(5.1)
where a and b are the outputs of the two modules and V (¢) indicates the variance of its argument. If we set a :D y ( t) and b :D y ( t ¡ 1) , assuming discretized time, this objective function is almost identical to the objectives formalized in equations 2.1 through 2.4. Thus, the information-theoretic approach provides a systematic motivation but not a different optimization problem. Zemel and Hinton (1992) have generalized this approach to extract several invariant variables. a and b became vectors, and V (¢) became the determinant of the covariance matrix. The output units then preferentially produced decorrelated responses. Stone and Bray (1995) have presented a learning rule that is based on an objective function similar to that of Becker and Hinton (1992) or equation 2.1 and which includes a memory trace mechanism as in (Foldi ¨ ak 1991). They dene two memory traces, P1 0 0 t0 D1 exp(¡t / ts ) y ( t ¡ t ) P1 yQ :D , 00 00 D 1 exp(¡t / ts ) P1 t 0 0 1 exp(¡t / tl ) y ( t ¡ t ) t0 D P (5.2) yN :D , 1 00 ) t00 D1 exp(¡t / tl
766
L. Wiskott and T. Sejnowski
one on a short timescale (yQ with a small ts ) and one on a long timescale (yN with a large tl ). The objective function is dened as F :D 0.5 log
h(y ¡ yN ) 2 i , h(y ¡ yQ ) 2 i
(5.3)
which is equivalent to equation 5.1 in the limit ts ! 0 and tl ! 1. The derived learning rule performs gradient ascent on this objective function. The examples in (Stone and Bray 1995) are linearly solvable so that only linear units are used. They include an example where two output units are trained. Inhibitory input from the rst to the second output unit enforces decorrelation. Examples in (Stone 1996) are concerned with disparity estimation, which is not linearly solvable and requires a multilayer network. Backpropagation was used for training with the error signal given by ¡F. A similar system derived from an objective function was presented in (Peng, et al. 1998). Mitchison (1991) presented a learning rule for linear units that is also derived from an objective function like that of equation 2.1. He pointed out that the optimal weight vector is given by the last eigenvector of matrix hPxPxT i. In the on-line learning rule, weights P were prevented from decaying to zero by an explicit normalization, such as w2i D 1. The extracted output signal would therefore depend strongly on the range of the individual input components, which may be arbitrarily manipulated by rescaling the input components. For instance, if there were one zero input component xi D 0, 6 an optimal solution would be wi D 1 and wi0 D 0 for all i0 D i, which would be undesirable. Therefore, it seems to be preferable to prevent weight decay by controlling the variance of the output signal and not the sum over the weights directly. The issue of extracting several output components was addressed by introducing a different bias for each output component, which would break the symmetry, if the weight space for slowest output components were more than one-dimensional. However, redundancy can probably be better reduced by the decorrelation constraint of equation 2.4 also used in other systems, with the additional advantage that suboptimal output-signal components also can be extracted in an ordered fashion. Many aspects of the learning algorithm presented here can be found in these previous models. Particularly the optimization problem considered by Becker and colleagues is almost identical to the objective function and constraints used here. The novel aspect of the system presented here is its formulation as a closed-form learning algorithm rather than an on-line learning rule. This is possible because the input signal is expanded nonlinearly, which makes the problem linear in the expanded signal. The solution can therefore be found by sphering and applying principal component analysis to the time-derivative covariance matrix. This has several consequences that distinguish this algorithm from others: The algorithm is simple and guaranteed to nd the optimal solution in one shot. Becker and Hinton (1995) have reported problems in nding
Slow Feature Analysis
767
the global maxima, and they propose several extensions to avoid this problem, such as switching between different learning rules during training. This is not necessary here. Several slow features can be easily extracted simultaneously. The learning algorithm automatically yields a large set of decorrelated output signals, extracting decorrelated slow features. (This is different from having several output units representing only one slow feature at a time by a one-of-n code.) This makes it particularly easy to consider hierarchical networks of SFA modules, since enough information can be propagated from one layer to the next. The learning algorithm presented here suffers from the curse of dimensionality. The nonlinear expansion makes it necessary to compute large covariance matrices, which soon becomes computationally prohibitive. The system is therefore limited to input signals of moderate dimensionality. This is a serious limitation compared to the on-line learning rules. However, this problem might be alleviated in many cases by hierarchical networks of SFA modules, where each module has only a low-dimensional input, such as up to 12 dimensions. Example 4 shows such a hierarchical system where a 65-dimensional input signal is broken down by several small SFA-modules. 5.2 Future Perspectives. There are several directions in which slow feature analysis as presented here can be investigated and developed further: Comparisons with other learning rules should be extended. In particular, the scaling with input and output dimensionality needs to be quantied and compared. The objective function and the learning algorithm presented here are amenable to analysis. It would be interesting to investigate the optimal responses of an SFA module and the consequences and limitations of using SFA modules in hierarchical networks. Example 2 demonstrates how several slowly varying features can be extracted by SFA. It also shows that the decorrelation constraint is insufcient to extract slow features in a pure form. It would be interesting to investigate whether SFA can be combined with independent component analysis (Bell & Sejnowski, 1995) to extract the truly independent slow features. Example 5 demonstrates how SFA modules can be used in a hierarchical network for learning various invariances. It seems, however, that learning multiple invariances simultaneously leads to a signicant degradation of performance. It needs to be investigated whether this is a principal limitation or can be compensated for by more training patterns and more complex networks.
768
L. Wiskott and T. Sejnowski
In the network of Example 4, there was no obvious implicit measure of pattern similarity. More research is necessary to clarify whether there is a hidden implicit measure or whether the network has enough capacity to learn a specic measure of pattern similarity in addition to the invariance. Finally, it has to be investigated whether invariances can also be learned from real-world image sequences, such as from natural scenes, within a reasonable amount of time. Some work in this direction has been done by Wallis and Rolls (1997). They trained a network for translation and other invariances on gray-value images of faces, but they did not test for generalization to new images. The number of images was so small that good generalization could not be expected. It is clear that SFA can be only one component in a more complex selforganizational model. Aspects such as attention, memory, and recognition (more sophisticated than implemented here) need to be integrated to form a more complete system. Acknowledgments We are grateful to James Stone for very fruitful discussions about this project. Many thanks go also to Michael Lewicki and Jan Benda for useful comments on the manuscript. At the Salk Institute, L. W. was partially supported by a Feodor-Lynen fellowship by the Alexander von Humboldt-Foundation, Bonn, Germany; at the Innovationskolleg, L. W. has been supported by HFSP (RG 35-97) and the Deutsche Forschungsgemeinschaft (DFG). References Barrow, H. G., & Bray, A. J. (1992). A model of adaptive development of complex cortical cells. In I. Aleksander & J. Taylor (Eds.), Articial neural networks II: Proc. of the Intl. Conf. on Articial Neural Networks (pp. 881–884). Amsterdam: Elsevier. Becker, S. (1993). Learning to categorize objects using temporal coherence. In C. L. Giles, S. J. Hanson, & J. D. Cowan (Eds.), Advances in neural information processing systems, 5 (pp. 361–368). San Mateo, CA: Morgan Kaufmann. Becker, S. (1996). Mutual information maximization: Models of cortical selforganization. Network: Computation in Neural Systems, 7(1), 7–31. Becker, S., & Hinton, G. E. (1992). A self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 355(6356), 161–163. Becker, S., & Hinton, G. E. (1995). Spatial coherence as an internal teacher for a neural network. In Y. Chauvin & D. E. Rumelhart (Eds.), Backpropagation: Theory, architecture and applications (pp. 313–349). Hillsdale, NJ: Erlbaum.
Slow Feature Analysis
769
Becker, S., & Plumbley, M. (1996). Unsupervised neural network learning procedures for feature extraction and classication. J. of Applied Intelligence, 6(3), 1–21. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129– 1159. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Oxford University Press. Brooks, M. J., & Horn, B. K. P. (Eds.). (1989). Shape from shading. Cambridge, MA: MIT Press. Eisele, M. (1997). Unsupervised learning of temporal constancies by pyramidaltype neurons. In S. W. Ellacott, J. C. Mason, & I. J. Anderson (Eds.), Mathematics of neural networks (pp. 171–175). Norwell, MA: Kluwer. Fleet, D. J., & Jepson, A. D. (1990). Computation of component image velocity from local phase information. Intl. J. of Computer Vision, 5(1), 77–104. Foldi ¨ ak, P. (1991). Learning invariance from transformation sequences. Neural Computation, 3, 194–200. Fukushima, K. (1999).Self-organization of shift-invariant receptive elds. Neural Networks, 12(6), 791–801. Fukushima, K., Miyake, S., & Ito, T. (1983). Neocognitron: A neural network model for a mechanism of visual pattern recognition. IEEE Trans. on Systems, Man, and Cybernetics, 13, 826–834. Jones, J. P., & Palmer, L. A. (1987). An evaluation of the two-dimensional Gabor lter model of simple receptive elds in cat striate cortex. J. of Neurophysiology, 58, 1233–1258. Konen, W., Maurer, T., & von der Malsburg, C. (1994).A fast dynamic link matching algorithm for invariant pattern recognition. Neural Networks, 7(6/7), 1019– 1030. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 541–551. Mitchison, G. (1991).Removing time variation with the anti-Hebbian differential synapse. Neural Computation, 3(3), 312–320. Olshausen, B. A., Anderson, C. H., & Van Essen, D. C. (1993). A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. J. of Neuroscience, 13(11), 4700–4719. Oram, M. W., & Perrett, D. I. (1994). Modeling visual recognition from neurobiological constraints. Neural Networks, 7(6/7), 945–972. O’Reilly, R. C., & Johnson, M. H. (1994).Object recognition and sensitive periods: A computational analysis of visual imprinting. Neural Computation, 6(3), 357– 389. Peng, H. C., Sha, L. F., Gan, Q., & Wei, Y. (1998). Energy function for learning invariance in multilayer perceptron. Electronics Letters, 34(3), 292–294. Rumelhart, D. E., Hinton, G. E., & McClelland, J. L. (1986). A general framework for parallel distributed processing. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing (Vol. 1, pp. 45–76). Cambridge, MA: MIT Press.
770
L. Wiskott and T. Sejnowski
Stewart Bartlett, M., & Sejnowski, T. J. (1998). Learning viewpoint invariant face representations from visual experience in an attractor network. Network: Computation in Neural Systems, 9(3), 399–417. Stone, J. V. (1996). Learning perceptually salient visual parameters using spatiotemporal smoothness constraints. Neural Computation, 8(7), 1463– 1492. Stone, J. V., & Bray, A. J. (1995). A learning rule for extracting spatio-temporal invariances. Network: Computation in Neural Systems, 6(3), 429–436. Theimer, W. M., & Mallot, H. A. (1994). Phase-based binocular vergence control and depth reconstruction using active vision. CVGIP: Image Understanding, 60(3), 343–358. Vapnik, V. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Wallis, G., & Rolls, E. (1997). Invariant face and object recognition in the visual system. Progress in Neurobiology, 51, 167–194. Zemel, R. S., & Hinton, G. E. (1992). Discovering viewpoint-invariant relationships that characterize objects. In R. P. Lippmann, J. E. Moody, & D. S. Touretzky (Eds.), Advances in neural information processing systems, 3 (pp. 299–305). San Mateo, CA: Morgan Kaufmann. Received January 29, 1998; accepted June 1, 2001.
NOTE
Communicated by Jonathan Victor
Information Loss in an Optimal Maximum Likelihood Decoding In e s Samengo
[email protected] Centro AtÂomico Bariloche, (8400) San Carlos de Bariloche, RÂõ o Negro, Argentina The mutual information between a set of stimuli and the elicited neural responses is compared to the corresponding decoded information. The decoding procedure is presented as an articial distortion of the joint probabilities between stimuli and responses. The information loss is quantied. Whenever the probabilities are only slightly distorted, the information loss is shown to be quadratic in the distortion.
Understanding the way external stimuli are represented at the neuronal level is one central challenge in neuroscience. An experimental approach to this end (Optican & Richmond, 1987; Eskandar, Richmond, & Optican, 1992; TovÂee, Rolls, Treves, & Bellis, 1993; Kjaer, Hertz, & Richmond, 1994; Heller, Hertz, Kjaer, & Richmond, 1995; Rolls, Critchley, & Treves, 1996; Treves, Skaggs, & Barnes, 1996; Rolls, Treves, & TovÂee, 1997; Treves, 1997; Rolls & Treves, 1998; Rolls, Treves, Robertson, Georges-Fran c¸ ois, & Panzeri, 1998) consists of choosing a particular set of stimuli s 2 S that can be controlled by the experimentalist and exposing these stimuli to a subject whose neural activity is being registered. The set of neural responses r 2 R is then dened as the whole collection of recorded events. It is up to the researcher to decide which entities in the recorded signal are considered as events r. For example, r can be dened as the ring rate in a xed time window, or as the time difference between two consecutive spikes, or the k rst principal components of the time variation of the recorded potentials in a given interval, and so forth. Once the stimulus set S and the response set R have been settled on, the joint probabilities P (r, s) may be estimated from the experimental data. This is usually done by measuring the frequency of the joint occurrence of stimulus s and response r for all s 2 S and r 2 R. The mutual information between stimuli and responses reads (Shannon, 1948)
ID
XX s
r
µ P ( r, s) log2
¶ P (r, s) , P (r ) P (s )
Neural Computation 14, 771–779 (2002)
(1)
° c 2002 Massachusetts Institute of Technology
772
InÂes Samengo
where P (r ) D P (s ) D
X
P ( r, s)
(2)
P ( r, s) .
(3)
s
X r
The mutual information quanties how much can be learned about the identity of the stimulus shown just by looking at the responses. Accordingly, and since I is symmetrical in r and s, its value is also a measure of the amount of information that the stimuli give about the responses. From a theoretical point of view, I is the most appealing quantity characterizing the degree of correlation between stimuli and responses that can be dened. This stems from the fact that I is the only additive functional of P ( r, s) , ranging from zero (for uncorrelated variables) up to the entropy of stimuli or responses (for a deterministic one to one mapping) (Fano, 1961; Cover & Thomas, 1991). However, even if formally sound, the mutual information has a severe drawback when dealing with experimental data. Many times, and specifically when analyzing data of multiunit recordings, the response set R is quite large, its size increasing exponentially with the number of neurons sampled. Therefore, the estimation of P ( r, s) from the experimental frequencies may be far from accurate, especially when recording from the vertebrate cortex, where there are long timescales in the variability and statistical structure of the responses. The mutual information I, being a nonlinear function of the joint probabilities, is extremely sensitive to the errors that may be involved in their measured values. As derived in Treves and Panzeri (1995), Panzeri and Treves (1996) and Golomb, Hertz, Panzeri, Treves, and Richmond (1997), the mean error in calculating I from the frequency table of events r and s is linear in the size of the response set. This analytical result has been obtained under the assumption that different responses are classied independently. Although there are situations where such a condition does not hold (Victor & Purpura, 1997), it is widely accepted that the bias grows rapidly with the size of the response set. Therefore, a common practice when dealing with large response sets is to calculate the mutual information not between S and R but between the stimuli and another set T , each of whose elements t is a function of the true response r, that is, t D t ( r) (Treves, 1997; Rolls & Treves, 1998). It is easy to show that if the mapping between r and t is one to one, then the mutual information between S and R is the same as the one between S and T . However, for one-to-one mappings, the number of elements in T is the same as in R. A wiser procedure is to choose a set T that is large enough not to lose the relevant information, but sufciently small as to avoid signicant limited sampling errors. One possibility is to perform a decoding procedure (Gochin, Columbo, Dorfman, Gerstein, & Gross, 1994;
Optimal Maximum Likelihood Decoding
773
Rolls et al., 1996; Victor & Purpura, 1996; Rolls & Treves, 1998). In this case, T is taken to coincide with S . To make this correspondence explicit, the set T will be denoted by S 0 and its elements t by s0 . Each s0 in S 0 is taken to be a function of r and is called the predicted stimulus of response r. As stated in Panzeri, Treves, Schultz, and Rolls (1999), this choice for T is the smallest that could potentially preserve the information of the identity of the stimulus. The data processing theorem (Cover & Thomas, 1991) states that since s0 is a function of r alone, and not of the true stimulus s eliciting response r, the information about the real stimulus can only be lost and not created by the transformation from r ! s0 . Therefore, the true information I is always at least as large as the decoded information ID , the latter being the mutual information between S and S 0 .1 In order to have I and ID as close as possible, it is necessary to choose the best s0 for every r. The procedure consists of identifying which of the stimuli was most probably shown, for every elicited response. The conditional probability of having shown stimulus s given that the response was r reads P (s|r) D
P ( r, s ) . P ( r)
(4)
Therefore, the stimulus that has most likely elicited response r is s0 ( r) D max P ( s|r ) D max P (r, s) . s
s
(5)
By means of equation 5, a mapping r ! s0 is established: each response has its associated maximum likelihood stimulus. Equation 4 provides the only denition of P ( s|r ) that strictly follows Bayes’ rule, so in this case, the decoding is called optimal. There are other alternative ways of dening P ( s|r ) (Georgopoulos, Schwartz, & Kettner, 1986; Wilson & McNaughton, 1993; Seung & Sompolinsky, 1993; Rolls et al., 1996), some of which have the appealing property of being simple enough to be plausibly carried out by downstream neurons themselves. The purpose here, however, is to quantify how much information is lost when passing from r to s0 using an optimal maximum likelihood decoding procedure. In general, there are several r associated with a given s0 . One may therefore partition the response space R in separate classes C (s ) D fr/ s0 (r ) D sg, one class for every stimulus. The number of responses in class s0 is Ns0 . Of course, some classes may be empty. Here, the assumption is made that each r belongs to one and only class (that is, equation 5 has a unique solution). 1
It should be kept in mind, however, that when ID is calculated from actual recordings, its value is typically overestimated because of limited sampling. Therefore, when dealing with real data sets, one may eventually obtain a value for ID that surpasses the true mutual information I. Nevertheless, whenever the number of elements in S0 is signicantly smaller than the number of responses r, the sampling bias in ID will be bound by the one obtained in the estimation of I.
774
InÂes Samengo
The joint probability of showing stimulus s and decoding stimulus s0 ( r) reads X (6) P ( s0 , s) D P ( r, s) , r2C ( s0 )
and the overall probability of decoding s0 , X X P ( s0 ) D P ( s0 , s) D P ( r) . s
(7)
r2C ( s0 )
Clearly, with these denitions, the decoded information, µ ¶ XX P ( s0 , s) 0 ( ) ID D P s , s log2 , P ( s0 ) P ( s) s s0
(8)
may be calculated, and has in fact been used in several experimental analyses (Rolls et al., 1996; Treves, 1997; Rolls & Treves, 1998; Panzeri et al., 1999). However, to date, no rigorous relationship between I and ID has been established. The derivation of such a relationship is the main purpose here. When a decoding procedure is performed, r is replaced by s0 . Such a mapping allows the calculation of P ( s0 , s ), after which any additional structure, which may eventually have been present in P ( r, s) , is neglected. For example, if two responses r1 and r2 encode the same stimulus s0 , it becomes irrelevant whether, for a given s, P (r1 , s) is much bigger than P ( r2 , s) or, on the contrary, P ( r1 , s ) ¼ P ( r2 , s ). The only thing that matters is the value of the sum of the two: their global contribution to P ( s0 , s) . As a consequence, it seems natural to consider the detailed variation of P ( r, s ) within each class when estimating the information lost in the decoding. In this spirit, and aiming at quantizing such a loss of information, P (r, s) is written as £ ¤ P s0 ( r) , s ( ) C D ( r, s) , (9) P r, s D Ns0 ( r) where D (r, s) D P ( r, s ) ¡ P[s0 ( r) , s] / Ns0 ( r ) . Thus, the joint probability P (r, s) , which in principle may have quite a complicated shape in R space, is separated into two terms. The rst one is at inside every single class C ( s0 ) , and the second is whatever is needed to re-sum P (r, s) . It should be noticed that X D ( r, s) D 0, (10) r2C ( s0 )
for all s. Summing equation 9 in s, £ ¤ P s0 ( r) C D (r ) , P (r ) D Ns0 ( r)
(11)
Optimal Maximum Likelihood Decoding
775
where
D ( r)
D
X
D ( r, s ) ,
(12)
D 0.
(13)
s
and X
D (r )
r2C ( s0 )
Replacing equations 9 and 11 in equation 1, one arrives at µ ¶ XX P ( r, s) I D ID C P ( r, s) log2 , Q ( r, s) r s
(14)
where Q ( r, s) D
P[s0 ( r) , s] P[s0 ( r) , s] C D (r ) Ns0 P ( s0 )
(15)
is a properly dened distribution, since it can be shown to be normalized and nonnegative. The term on the right of equation 14 is the Kullback-Leibler divergence (Kullback, 1968) between the distributions P and Q, which is guaranteed to be nonnegative. This conrms the intuitive result ID · I, the equality being valid only when
D ( r) P[s0 (r ) , s] D D ( r, s ) P[s0 (r ) ],
(16)
for all r and s. Equation 14 states the quantitative difference between the full and the decoded information and is the main result here. The amount of lost information is therefore equal to the informational distance between the original probability distribution P ( r, s ) and a new function Q (r, s) . It can be easily veried that µ ¶ XX Q ( r, s) (17) ID D Q (r, s) log2 , Q ( r) Q ( s) s r where Q ( r) D Q ( s) D
X s
Q ( r, s ) D P ( s) ,
r
Q ( r, s ) D P ( r) .
X
(18)
Therefore, the decoded information can be interpreted as a full mutual information between the stimuli and the responses, but with a distorted probability distribution Q ( r, s ). In this context, the difference I ¡ ID is no more
776
InÂes Samengo
than the distance between the true distribution P (r, s) and the distorted one Q ( r, s) . When is equation16 fullled? Surely, if there is at most one response in each class, D is always zero, and I D ID . Also, if P ( r, s) is already at in each class, there is no information loss. However, if P (r, s) is not at inside every class but obeys the condition P ( r, s) D Ps0 ( r) P ( s0 , s) for a suitable P (s0 , s) and some function Ps0 ( r) that sums up to unity within C (s0 ), one can easily show that equation 16 holds. Notice that this case implies that if r1 and r2 belong to C ( s0 ) , then P (r1 , s) / P (r2 , s) is independent of r for all s. In other words, within each class C ( s0 ) , the different functions P ( r|s ) obtained by varying s differ from one another by a multiplicative constant. These conditions coincide with the ones given by Panzeri et al. (1999) for having an exact decoding within the short time limit. However, in the derivation here, there are no assumptions about the interval in which responses are measured. Therefore, the decoding being exact whenever equation 16 is fullled is not a consequence of the short time limit carried out by Panzeri et al. (1999), but rather a general property of the maximum likelihood decoding. Next, by making a second-order Taylor expansion of equation 14 in the distortions D ( r, s ) and D (r ), one may show that I D ID C
XX s
P ( s0 , s)
s0
E ( s0 , s ) C O (D 2 ) , 2 ln 2
(19)
where "³ ´2 ³ ´2 # 1 X D (r, s) D (r ) ¡ . E ( s , s) D Ns0 r2C ( s0 ) P (s0 , s ) / Ns0 P (s0 ) / Ns0 0
(20)
Therefore, in the small D limit, the difference between I and ID is quadratic in the distortions D ( r, s ) and D ( r) . This means that if in a given situation these quantities are guaranteed to be small, then the decoded information will be a good estimate of the full information. Equation 20 is equivalent to *³ 0
E ( s , s) D
P (r, s) P ( s0 , s) / Ns0
´2
³ ¡
P (r ) P ( s0 ) / Ns0
´2 + ,
(21)
C ( s0 )
where h f (r ) iC
( s0 )
D
1 N ( s0 )
X
f ( r) .
r2C (s0 )
As a consequence, the relevant parameter in determining the size of E ( s0 , s) is given by the mean value—within C ( s0 ) —of a function that essentially measures how different are the true probability distributions P ( r, s ) and P ( r) from their attened versions, P ( s0 , s) / Ns0 and P ( s0 ) / Ns0 .
Optimal Maximum Likelihood Decoding
777
To summarize, this note presents the maximum likelihood decoding as an articial—but useful—distortion of the distribution P ( r, s ) within each class C ( s0 ) . The decoded information is also shown to be a mutual information, the latter calculated with the distorted probability distribution. The difference between I and ID is the Kullback-Leibler distance between the true and distorted distributions. As such, it is always nonnegative, and it is easy to identify the conditions for the equality between the two information measures. Finally, for small distortions D , the amount of lost information is expressed as a quadratic function in D . In short, the aim of the work is to present a formal way of quantizing the effect of an optimal maximum likelihood decoding. It should be kept in mind that in real situations, where only a limited amount of data is available, the estimation of P (r|s) may well involve a careful analysis in itself. Some kind of assumption (as for example, a gaussian shaped response variability) is usually required. The validity of the assumptions made depends on the particular data at hand. An inadequate choice for P ( r|s ) may of course lead to a distorted value of I, and in fact, the bias may be in either direction. If the choice of P ( r|s ) does not even allow the correct identication of the maximum likelihood stimulus (see equation 5), then the calculated value of ID will also be distorted. The purpose of this note, however, is to quantify how much information is lost when passing from r to s0 (r ). No attempt has been made to quantify I or ID for different estimations of P (r|s) . Sometimes P ( s0 , s) is dened in terms of P (r, s) without actually decoding the stimulus toP be associated with each response. For example, P ( s0 , s) can be introduced as r P ( r, s0 ) P ( r, s) / P2 ( r) (Treves, 1997). This approach, although formally sound, is not based in an r ! s0 mapping and does not allow a partition of R into classes. Therefore, it is not directly related to the analysis presented here. However, there might be analogous derivations where one may get to quantify the information loss in this case, as well. Acknowledgments
I thank Bill Bialek, Anna Montagnini, and Alessandro Treves for very useful discussions. This work has been partially supported with a grant of A. Treves, of the Human Frontier Science Program, number RG 01101998B. References Cover, M. T., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Eskandar, E. N., Richmond, B. J., & Optican, L. M. (1992). Role of inferior temporal neurons in visual memory. I. Temporal encoding of information about visual images, recalled images, and behavioural context. J. Neurophysiol., 68, 1277–1295.
778
InÂes Samengo
Fano, R. M. (1961) Transmission of information: A statistical theory of communications. Cambridge, MA: MIT Press. Georgopoulos, A. P., Schwartz, A., & Kettner, R. E. (1986). Neural population coding of movement direction. Science, 233, 1416–1419. Gochin, P. M., Colombo, M., Dorfman, G. A., Gerstein, G. L., & Gross, C. G. (1994). Neural ensemble encoding in inferior temporal cortex. J. Neurophysiol, 71, 2325–2337. Golomb, D., Hertz, J., Panzeri, S., Treves, A., & Richmond, B. (1997). How well can we estimate the information carried in neuronal responses from limited samples? Neural Comp., 9, 649–655. Heller, J., Hertz, J. A., Kjaer, T. W., & Richmond, B. J. (1995). Information ow and temporal coding in primate pattern vision. J. Comput. Neurosci., 2, 175–193. Kjaer, T. W., Hertz, J. A., & Richmond, B. J. (1994).Decoding cortical neuronal signals: Network models, information estimation and spatial tuning. J. Comput. Neurosci., 1, 109–139. Kullback, S. (1968). Information theory and statistics. New York: Dover. Optican, L. M., & Richmond, B. J. (1987). Temporal encoding of two dimensional patterns by single units in primate inferior temporal cortex: III. Information theoretic analysis. J. Neurophysiol., 57, 162–178. Panzeri, S., & Treves, A. (1996). Analytical estimates of limited sampling biases in different information measures. Network, 7, 87–107. Panzeri, S., Treves, A., Schultz, S., & Rolls, E. T. (1999).On decoding the responses of a population of neurons from short time windows. Neural Comput., 11, 1553–1577. Rolls, E. T., Critchley, H. D., & Treves, A. (1996). Representation of olfactory information in the primate orbitofrontal cortex. J. Neurophysiol., 75(5), 1982– 1996. Rolls, E. T., & Treves, A. (1998).Neural networks and brain function. Oxford: Oxford University Press. Rolls, E. T., Treves, A., Robertson, R. G., Georges-Fran¸cois, P., & Panzeri, S. (1998). Information about spatial view in an ensemble of primate hippocampal cells. J. Neurophysiol., 79, 1797–1813. Rolls, E. T., Treves, A., & TovÂee, M. J. (1997). The representational capacity of the distributed encoding of information provided by populations of neurons in primate temporal visual area. Exp. Brain. Res., 114, 149–162. Seung, H. S., & Sompolinsky, H. (1993). Simple models for reading neural population codes. Proc. Nat. Ac. Sci. USA, 90, 10749–10753. Shannon, C. E. (1948). A mathematical theory of communication. AT&T Bell Laboratories Technical Journal, 27, 379–423. TovÂee, M. J., Rolls, E. T., Treves, A., & Bellis, R. J. (1993). Information encoding and the responses of single neurons in the primate temporal visual cortex. J. Neurophysiol., 70, 640–654. Treves, A. (1997). On the perceptual structure of face space. BioSyst., 40, 189–196. Treves, A., & Panzeri, S. (1995). The upward bias in measures of information derived from limited data samples. Neural Comp., 7, 399–407. Treves, A., Skaggs, W. E., & Barnes, C. A. (1996). How much of the hippocampus can be explained by functional constraints? Hippocampus, 6, 666–674.
Optimal Maximum Likelihood Decoding
779
Victor, J. D., & Purpura, K. P. (1996). Nature and precision of temporal coding in visual cortex: A metric space analysis. J. Neurophysiol., 76, 1310–1326. Victor, J. D., & Purpura, K. P. (1997). Metric-space analysis of spike trains: Theory, algorithms and application. Network, 8, 127–164. Wilson, M. A., & McNaughton, B. L. (1993). Dynamics of the hippocampal ensemble code for space. Science, 261, 1055–1058. Received March 5, 2001; accepted June 1, 2001.
NOTE
Communicated by AndrÂe Longtin
The Reliability of the Stochastic Active Rotator K. Pakdaman
[email protected] Inserm U444, FacultÂe de MÂedecine Saint-Antoine, 75571 Paris Cedex 12, France The reliability of ring of excitable-oscillating systems is studied through the response of the active rotator, a neuronal model evolving on the unit circle, to white gaussian noise. A stochastic return map is introduced that captures the behavior of the model. This map has two xed points: one stable and the other unstable. Iterates of all initial conditions except the unstable point tend to the stable xed point for almost all input realizations. This means that to a given input realization, there corresponds a unique asymptotic response. In this way, repetitive stimulation with the same segment of noise realization evokes, after possibly a transient time, the same response in the active rotator. In other words, this model responds reliably to such inputs. It is argued that this results from the nonuniform motion of the active rotator around the unit circle and that similar results hold for other neuronal models whose dynamics can be approximated by phase dynamics similar to the active rotator.
Neurons are subject to various forms of internal and external noise (for a review, see Holden, 1976). Such perturbations can alter their response by reducing nonlinear distortions in their input-output relation or enhancing their sensitivity to weak signals (for a review, see Segundo, Vibert, Pakdaman, Stiber, & Diez-MartÂõ nez, 1994). The mechanisms of such noise effects have been investigated using neuronal models (Stein, French, & Holden, 1972; Longtin, 1993; Collins, Chow, & Imhoff, 1995; Bulsara, Elston, Doering, Lowen, & Lindenberg, 1996; Shimokawa, Pakdaman, & Sato, 1999a, 1999b; Shimokawa, Pakdaman, Takahata, Tanabe, & Sato, 2000; Shimokawa, Rogel, Pakdaman, & Sato, 1999; Tanabe, Sato, & Pakdaman, 1999). Noise-like inputs have also been used to reveal the ring reliability of neurons. Preparations of well-identied aplysia neurons (Bryant & Segundo, 1976), rat muscle spindles (Kr¨oller, Grusser, ¨ & Weiss, 1988), rat neocortical neurons (Mainen & Sejnowski, 1995), aplysia buccal pacemakers (Hunter, Milton, Thomas, & Cowan, 1998), salamander and rabbit retinal ganglion cells (Berry, Warland, & Meister, 1997), y ganglion cells (Haag & Borst, 1998), and cat lateral geniculate cells (Reinagel & Reid, 2000) have all exhibited a remarkable reproducibility of discharge times when stimulated repetitively by the same noiselike input. Reliable ring constitutes a prerequisite for nervous systems to operate temporal codes. In this respect, Neural Computation 14, 781–792 (2002)
° c 2002 Massachusetts Institute of Technology
782
K. Pakdaman
determining the conditions for reliable ring also serves to clarify whether nervous systems can potentially take advantage of it. The main purpose of this work is to present a theoretical analysis of ring reliability and identify some of the key issues involved. Previous analyses of this phenomenon have discussed the importance of the voltage slope at threshold crossing and the frequency content of the signal for reliable ring (Hunter et al., 1998; Cecchi et al., 2000). The approach here differs from these others in that it relies on random dynamical system theory (Arnold, 1998). In fact, it is shown through the presentation of an elementary model that results from this eld provide the proper framework for analyzing the reliability of neuronal systems. The model used to illustrate this point is the active rotator (AR) (Shinomoto & Kuramoto, 1986). The AR consists of a system moving around the unit circle whose phase Q satises dQ D f (Q ) C I ( t ) , dt
(1)
where Q 2 S1 (the unit circle), f (Q ) D 1 ¡ a sin(Q ) , and I ( t) represents the stimulation. We denote by W ( t, Q ) the state of the AR at time t given the initial condition Q. The AR is one of the most elementary models of excitable-oscillating systems, and is similar to the h-neuron derived by Gutkin and Ermentrout (1998) as a canonical model for type I membranes. When |a| < 1, Q rotates around the unit circle, displaying periodic oscillations. Conversely, when |a| > 1, there is a pair of equilibrium phases, Qs and Qu , on the unit circle, with the former being locally stable and the latter being unstable. Small perturbations of Qs are rapidly damped out, while larger ones, which take the system over Qu , result in a return to the stable point along the longer arc. In this way, Qu acts as a threshold. The ring is here dened as Q rotating counterclockwise through 2p / 3. The noisy AR—when I is white gaussian noise—has been used as a means to investigate the inuence of random perturbations on coupled oscillators (Stratanovich, 1967). The AR has also been one of the rst models in which two topics, the inuence of noise on the transition between excitable and oscillating regimes (Sigeti & Horsthemke, 1989) and stochastic resonance (Wiesenfeld, Pierson, Pantazelou, & Moss, 1994) have been studied. Similarly, the related h -neuron has been used to investigate the mechanisms underlying high interspike interval variability (Gutkin & Ermentrout, 1998). Furthermore, assemblies of interacting ARs have been shown to display synchronous oscillations in the presence of some noise (Shinomoto & Kuramoto, 1986; Kurrer & Schulten, 1995), which strongly inuence their response to weak periodic forcing (Tanabe, Shimokawa, Sato, & Pakdaman, 1999). These studies are essentially concerned with noise-induced oscillations and their regularity. They do not treat reliability. This work relies on
The Reliability of the Stochastic Active Rotator
783
a different standpoint to address this question. To clarify the basis of the approach, the case of periodic forcing is rst described. When the stimulation I is a T-periodic signal, the dynamics of the forced AR are analyzed through the iterates of a stroboscopic map FT , which to Q 2 S1 associates y D FT (Q ) D W ( T, Q ), that is, the state of the AR after a time T given the initial condition Q. The stroboscopic map is an orientationpreserving diffeomorphism on S1 , so that its iterates display mainly one of two behaviors (de Melo & van Strien, 1993). For all initial Q0 2 S1 , the sequence Qn D FnT (Q0 ) is either (1) asymptotically periodic with a period that does not depend on the initial condition or (2) quasiperiodic with an orbit that is dense in S1 . When Qn is periodic, the unit displays a periodic succession of n rings per m input cycles, referred to as n: m phase locking. In this case, the units display phase synchrony with respect to the input. To illustrate this, let us consider an ensemble of ARs starting at randomly distributed initial phases and all receiving the same periodic stimulation. After a transient time, discharges of the units cluster into m groups such that units within each group re simultaneously, and the discharge times of the different groups are phase shifted with respect to one another. A simple illustration of this is the case of 1: 2 locking, where units re every other input cycle, so that depending on the initial condition, they will re in either even or odd cycles. In such situations, there is synchronization with respect to the input, but this does not ensure that all units discharge simultaneously. To distinguish between synchronization with respect to the input and the interunit synchrony, that is, simultaneous ring of the units (or, equivalently, alignment of discharge times in raster plots recorded from repetitive stimulation of a single neuron), we say that the ring is absolutely reliable when the latter is satised, regardless of synchrony to the input. This denition reects the experimental protocol used to assess neuronal reliability (see the references in the second paragraph). In this sense, absolute reliable ring is the exception rather than the rule in response to periodic stimulation as (1) quasiperiodic ring is unreliable, and (2) in phase-locked regimes, either input parameters have to be set so that the system operates in n: 1 phase locking or initial conditions have to be selected to ensure full interunit synchrony. This situation contrasts with the case of noisy forcing where the system almost surely achieves absolutely reliable ring, irrespective of model and input parameters. This holds in both the excitable regime (|a| > 1) and the oscillating regime (|a| < 1), as long as the AR’s motion around the 6 circle is nonuniform (a D 0). The remainder of this note is devoted to the clarication of this point. At rst, this result is announced from the point of view of random dynamical system theory. The presentation of this part follows Arnold (1998). Then, the interpretation of this result is developed through the introduction of a stochastic map, which plays a role similar to that of the stroboscopic map.
784
K. Pakdaman
When the stimulation I is white gaussian noise of intensity s, equation 1 is a stochastic differential equation with an associated Fokker-Planck equation: @p @t
( t, Q ) D ¡
ª s 2 @2 p ( t, Q ) , f (Q ) p (t, Q ) C 2 @Q2 @Q @ ©
(2)
with periodic boundary conditions p (t, Q C 2p ) D p ( t, Q ). For all initial densities, the solutions of equation 2 tend eventually to the same stationary distribution denoted ps (Q ) (Stratanovich, 1967). The Lyapunov exponent associated with equation 1 is then given by Arnold (1998): Z 1 t 0 f (Q ( s) ) ds t!1 t 0 Z 2p D f 0 (Q ) ps (Q ) dQ
l D lim
(3) (4)
0
D ¡
s2 2
Z
2p 0
p0s2 (Q ) dQ, ps ( Q )
(5)
where f 0 and p0s are, respectively, the derivatives of f and ps . Equation 5 6 0, shows that l < 0 unless p0s (Q ) D 0 for all Q. This means that l < 0 for a D and l D 0 when a D 0. Negative Lyapunov exponents are indicative of local stability of solutions, in the sense that when the Lyapunov exponent computed along a solution W ( t, Q0 ) going through the initial condition Q0 is negative, then solutions going through Q0 C d Q with dQ sufciently small tend asymptotically to W ( t, Q0 ) . Thus, local stability is dened in terms of asymptotic stability of solutions with respect to small variations of initial conditions. When the stochastic dynamics are ergodic and are conned to a bounded set, local stability also constrains the global stability with respect to variations in initial conditions (Le Jan, 1987; Crauel, 1990). More precisely, 6 the fact that the Lyapunov exponent is negative implies that when a D 0, there are two stationary stochastic processes, a(t) and b ( t) , that are solutions of equation 1, and such that all solutions of equation 1, except b (t) , eventually approach a(t) .1 Conversely, when time is reversed, all solutions except a(t) tend to b ( t) . The two solutions a(t) and b (t) are referred to as stochastic equilibria of the system (Arnold, 1998), with a(t) being stable and b (t ) unstable. In the deterministic AR, a bifurcation takes place at |a| D 1, where the two equilibria coexisting for |a| > 1 coalesce and then vanish for |a| < 1. The bifurcation at |a| D 1 separates the excitable and oscillating regimes. 6 There are no such qualitative changes in the stochastic AR: for all a D 0, 1
In general, for a nonconstant f , there are nitely many processes a(t) and b (t).
The Reliability of the Stochastic Active Rotator
785
there are exactly one stable and one unstable stochastic equilibria, even in 6 the oscillating range |a| < 1 (a D 0), where the deterministic AR has no equilibria. The following paragraphs detail how the results on the random dynamics of the AR imply that the AR receiving a white noise stimulation is reliable. The key to this interpretation is the introduction of a stochastic map, which plays a role similar to that of the stroboscopic map, except that in contrast to the stroboscopic map, which can display various forms of phase locking and quasiperiodicity depending on input and system parameters, the iterates of the stochastic map tend almost surely to a unique xed point, regardless of the choice of the parameters, thus ensuring absolute reliability. For the construction of the stochastic return map, we introduce the following equation, dx D f ( x ) C I ( t) , dt
(6)
which is the same as equation 1, except that x is on the real line, not on S1 . We remind that f ( x) D 1 ¡ a sin(x) and I ( t) represents white gaussian noise. The construction of the map holds in both excitable and oscillating regimes. For clarity, in the following we use Roman letters x, u, and v and Greek letters Q, y when the dynamics is considered on the real line and on the unit circle (identied with [0, 2p ], with Q D 0 and Q D 2p representing the same point in S1 ), respectively. Starting from x (t D 0) D 0, we denote by t1 the rst passage time of ( ) x t through 2p . In other words, x (t1 ) D 2p , and x (s ) < 2p for all s 2 [0, t1 ) . For u 2 [0, 2p ], we dene Rt1 ( u ) D x ( t1 , u ) , where t ! x (t, u ) is the solution of equation 6 with the initial condition x (0, u ) D u. Then Rt1 is a diffeomorphism from [0, 2p ] unto [2p , 4p ], and satises (1) Rt1 (0) D 2p , (2) Rt1 (2p ) D 4p , and (3) Rt1 ( u ) > Rt1 ( v ) whenever u > v, that is, Rt1 is strictly increasing. The construction of Rt1 is illustrated in Figure 1. For Q D x 2 [0, 2p ], we dene the stochastic return map Ft1 (Q ) D Rt1 (x ) ¡2p . The map Ft1 is a monotonic increasing diffeomorphism of [0, 2p ] unto itself, which satises Ft1 (0) D 0, and Ft1 (2p ) D 2p . The map Ft1 , when considered from S1 into itself, is the stochastic return map associated with the noisy active rotator. Indeed, t1 is the time at which the state point initially at Q D 0 accomplishes one counterclockwise rotation around the circle and returns to the same position. In the same way, we denote by t2 ( > t1 ), the time after which this state point accomplishes another counterclockwise rotation around the circle. This rotation is associated with a new return map FT2 with T2 D t2 ¡ t1 the duration of the second rotation. The construction can be continued with FTn being the return map between the ( n ¡ 1) th and the nth counterclockwise rotations around the circle. The FTn form a family of independent and identically distributed orientation-preserving circle
786
K. Pakdaman
x
Rt (2 )=4 1 Rt (u) 1 R (v) t1
2
Rt (0)=2 1
u v 0
t1
t
Figure 1: Schematic illustration of the construction of the return map. The abscissa is time in arbitrary units, and the ordinate is the dimensionless variable x. The curves represent solutions x ( t, 0) , x ( t, v) , x ( t, u) , and x ( t, 2p ) of equation 6 for the initial conditions 0, v, u and 2p . Given that the right-hand side of equation 6 is 2p periodic with respect to x, we have x ( t, 2p ) D 2p C x ( t, 0) ; the upper thick line is obtained by translating upward by 2p the lower thick line. Thus, x ( t, 2p ) reaches 4p for the rst time at the same t1 at which x ( t, 0) reaches 2p for the rst time. In this way, given that equation 6 is one-dimensional so that its solutions cannot cross one another when plotted against time, for any u in (0, 2p ) , x ( t, u ) is conned between the two thick lines, and, consequently, x ( t1 , u ) is within (2p , 4p ) . In other words, u ! x ( t1 , u ) ) maps [0, 2p ] unto [2p, 4p ]. This map is denoted by Rt1 , and it satises Rt1 ( v ) < Rt1 ( u) for v < u.
diffeomorphisms. For the analysis of the dynamics, it is more convenient to consider the FTn as maps of [0, 2p ] unto itself rather than as circle maps. Each FTn is strictly increasing on this interval and satises FTn (0) D 0, and FTn (2p ) D 2p . Denoting Gn D FTn ± FTn¡1 ± ¢ ¢ ¢ FT1 , the dynamics of the noisy AR is determined by the iterates Qn D Gn (Q0 ) . These iterates have two xed points: Gn (0) D 0, and Gn (2p ) D 2p . Usually the local stability of a xed point of a map is determined by the slope of this map at that point. The local stability of the xed point Q D 0 with respect to small variations in initial conditions can be measured by the following quantity, which extends that of the slope of a map at a xed point (Arnold, 1998): " SN D lim
n!1
n Y kD 1
#1 / n 0
FTk (0)
.
(7)
The Reliability of the Stochastic Active Rotator
787
Given that »Z F0Tk (0) D exp
tk
¼ f 0 (Q ( s) ) ds ,
(8)
tk¡1
where Q ( s) denotes the solution of equation 1 with Q (0) D 0 and exactly one counterclockwise turn on every interval [tk¡1 , tk ]. We have: » Z tn ¼ 1 SN D lim exp f 0 (Q ( s) ) ds n!1 n 0 £ ¤ D exp lTN ,
(9) (10)
where TN D limn!1 tn / n, so that 1 / TN is the mean rate of rotation (i.e., mean discharge rate), and l is the Lyapunov exponent dened in equation 3. 6 Given that l < 0 for a D 0, we have 0 · SN < 1, which implies that the xed point Q D 0 is locally stable. Given that Q D 0 and Q D 2p represent the same point in S1 , the slope at Q D 2p is exactly the same as that at Q D 0, so that both xed points are locally stable under the action of the Gn . This means that starting with some Q0 2 [0, 2p ], close to zero (resp. 2p ), the sequence Qn D Gn (Q0 ) tends to zero (resp. 2p ). Furthermore, since the Gn are all increasing, if Qn D Gn (Q0 ) ! 0 (resp. 2p ) as n ! C 1, so does Qn0 D Gn (Q00 ) for all 0 · Q00 · Q0 (resp. Q0 · Q00 · 2p ). We take Q¤ D supfQ0: Qn D Gn (Q0 ) ! 0g, and y ¤ D inffQ0: Qn D Gn (Q0 ) ! 2p g; then [0, Q ¤ ) and ( y ¤ , 2p ] are the basins of attraction of zero and 2p , and we have Q¤ · y ¤ . In fact, because of the results on the random dynamics of the AR, we have Q¤ D y ¤ D b (0) , and more generally Gn (Q ¤ ) D b ( tn ) . Indeed, equation 1 has the unique unstable stochastic equilibrium b (t ), and all other solutions aggregate around a single one, a(t). Assuming Q¤ < y ¤ contradicts this because it implies that solutions of initial conditions Q in the nonempty open interval (Q¤ , y ¤ ) would not converge to the others. Thus, this set must be empty, and Q¤ D y ¤ D b (0) . This implies that the iterates of all points on [0, 2p ], except Q¤ , tend either to zero or 2p depending on whether they are larger or smaller than Q¤ . Figure 2 illustrates this point. The two panels represent examples of Gn for two different values of a in the excitable and oscillating regimes. In both examples, the iterates become steplike, with plateaus at zero and 2p . The sharp increase between the two at parts takes place around the iterates of Q¤ . The fact that for the interval maps the iterates of initial conditions tend to zero or 2p means that for the noisy AR, the corresponding sequence Qn tends to the origin of phases because zero and 2p represent the same point on S1 . In other words, this point is the unique stable xed point of the products of stochastic return maps of the AR, and it attracts all orbits except that of Q¤ . Finally, the fact that iterates of most initial phases tend to zero
788
K. Pakdaman
6
6
4
4
2
2
0 0
2
4
6
0 0
2
4
6
Figure 2: Iterates of the stochastic circle map associated with the AR, for a D 0.9 and s D 0.2 (left) and a D 1.1 and s D 0.5 (right). The left panel shows Gn (Q ) for n D 1, 3, 5, 10, and 15 and the right panel for n D 1 and 2. In both cases, the maps Gn tend to a steplike shape, illustrating the convergence of the iterates of most initial conditions to either zero or 2p .
means that after some transient time, AR units started with different initial conditions discharge simultaneously: the spike train evoked by the noiselike stimulation is absolutely reliable. In this sense, noiselike stimulations lead to a response that as far as the long-term dynamics of the iterates of the return map are concerned resembles 1: 1 phase locking. In other words, while depending on the model and stimulation parameters, the map associated with periodic forcing can have diverse dynamics; the one derived for white gaussian noise stimulation always converges to a unique xed point. This difference is not conned to the particular example presented here. In fact, orbits of products of independent and identically distributed orientation-preserving diffeomorphisms converge to a unique xed point under more general conditions (Kaijser, 1993). This remarkable property suggests that whenever the response of a neuron can be approximated by an orientation-preserving circle map, variable inputs evoke reliable ring, even when regular ones, such as pacemaker stimulation, do not. We have veried this point in a number of examples using neuronal models such as the FitzHughNagumo (Pakdaman & Mestivier, 2001). In conclusion, this work has presented an analysis of ring reliability in a neuronal model, namely, the active rotator. It showed that for this model, the ring evoked by white gaussian noise stimulation is reliable. The key to this result was that when the parameter a is different from zero, the Lyapunov exponent of the system is negative, which indicates that trajectories converge one to another. The parameter a controls the motion of the system around the circle. Thus, the important property of the system that leads to reliability is the fact that the speed of the movement of the system around the circle is not constant; at some phases, the system moves faster than
The Reliability of the Stochastic Active Rotator
789
at others. Let us provide a heuristic explanation for the relation between the nonuniform motion around the circle and the reliability through one example, that is, when 0 · a < 1. When a D 0, all points on the circle rotate at the same speed, regardless of their position, even when there is a stimulation. Thus, the distance between two points remains the same along time. In contrast, when 0 < a < 1, points get close to one another on (¡p / 2, p / 2) , where f 0 < 0, and move apart in (p / 2, 3p / 2) , where f 0 > 0. Thus, there is an alternation of contraction and expansion regimes. In the absence of stimulation (I D 0), these effects balance each other out, so that after one turn around the circle, the distances between two points recover their initial value. Conversely, when I is a white gaussian noise, the balance between the expansion and the contraction breaks down: the system spends more time in the contracting regions. This is the meaning of equation 4 in which ps (Q ) dQ can be interpreted as the fraction of time the AR spends in (Q, Q C dQ ). Because of this difference, the distance between points decreases along time, leading to the eventual aggregation of trajectories around a single one. Interestingly, nonuniform movement around a closed trajectory is common in neuronal systems because typically the action potential lasts a few milliseconds and its onset corresponds to the expanding regime, while the refractory period can take orders of magnitude longer and corresponds to the contracting regime. In this way, our analysis predicts that such systems respond reliably to noiselike stimuli as long as they remain in the vicinity of the closed trajectory. The analysis performed in this work establishes that for a typical noiselike stimulation, there exists a single asymptotically stable solution, which attracts the others. Thus, the ring sequence observed is, after possibly a transient time, formed by the discharge times of this specic solution. This phenomenon accounts for absolute reliability. Neurons are subject to internal and external noise, due, for instance, to spontaneous synaptic activity and uctuations in the opening and closing of ionic channels (White, Rubinstein, & Kay, 2000). Adding perturbations to the dynamics of the AR, for instance, through an extra white gaussian noise representing internal noise to equation 1, is one way to account for these and evaluate their inuence of neuronal reliability. Such analyses have been carried out in previous studies in the AR and other neuronal models. The following paragraphs provide a synthetic summary of their results, adapted to the case of the AR, with the notations and concepts introduced in this study. In terms of the dynamics of the aperiodically forced AR (see equation 1), small perturbations, representing noise, evoke small variations around the stable stochastic equilibrium point a(t) . These uctuations induce some variability in the discharge times. The extent of the variability is a decreasing function of the slope of the stochastic equilibrium at the onset of ring (Freidlin & Wentzell, 1998). Heuristically, this means that the steeper the slope at threshold and the shorter the time interval the system spends in the
790
K. Pakdaman
expanding region, the lower is the perturbation-induced variability in the discharge time. The importance of the slope at threshold in ring reliability has been analyzed by Cecchi et al. (2000). The inuence of large perturbations on ring reliability is the one expected when the AR is in the oscillating regime: reliability decreases with perturbation intensity. The situation is similar when the AR is excitable and the amplitude of the stimulation I ( t) is large. However for an excitable AR receiving a weak input I ( t) , ring reliability has a nonmonotonic dependence on the perturbation intensity: it increases up to a certain optimal perturbation level and then decreases. This surprising effect comes from the fact that for low-subthreshold perturbations, discharges appear and cluster around time intervals where the stochastic equilibrium a(t) displays large subthreshold oscillations. Description of this phenomenon when the signal I ( t) is (1) a single excitatory postsynaptic potential is given in Pei, Wilkens, and Moss (1996), Tanabe, Sato, & Pakadaman (1999), and Tanabe and Pakdaman (in press), (2) a periodic input in Shimokawa, Rogel, et al. (1999), and (3) a periodic input or an aperiodic one resembling the one considered in this work in Tanabe and Pakdaman (in press b). Finally, besides the analysis of the specic model, another contribution of this work is methodological. In the same way that the standard methods from the geometric theory of dynamical systems have proven to be useful in understanding the complex responses of neurons to periodic stimulation or the complex intrinsic oscillations displayed by some cells, the extension of these methods to stochastic systems—the theory of random dynamical systems—is an invaluable tool for analyzing the response of neurons to other classes of stimulation, such as repeated presentation of the same noiselike inputs. The example treated in this work aims to illustrate this point by showing how the issue of reliability of ring could be considered as that of the convergence of a random dynamical system to a stochastic xed point. As a conrmation, we have applied this methodology to other neuronal models such as the Hodgkin-Huxley model and the FitzHugh-Nagumo model. The detailed results will be reported elsewhere (Pakadaman & Tanabe, in press). Furthermore, our analysis also highlighted that the key measure for reliability is the Lyapunov exponent of the system. When this quantity is negative, the ring tends to be reliable. In this way, we have at our disposal a measure that can be estimated from experimental recordings and can be used in the assessment of ring reliability.
References Arnold, L. (1998). Random dynamical systems. Berlin: Springer-Verlag. Berry, M. J., Warland, D. K., Meister, M. (1997). The structure and precision of retinal spike trains. Proc. Natl. Acad. Sci. U.S.A., 94, 5411–5416.
The Reliability of the Stochastic Active Rotator
791
Bryant, H., & Segundo, J. P. (1976). Spike initiation by transmembrane current: A white noise analysis. J. Physiol, 260, 279–314. Bulsara, A. R., Elston, T. C., Doering, C. R., Lowen, S. B., & Lindenberg, K. (1996). Cooperative behavior in periodically driven noisy integrate-re models of neuronal dynamics. Phys. Rev. E, 53, 3958–3969. Cecchi, G. A., Sigman, M., Alonso, J. M., Martinez, L., Chialvo, D. R., & Magnasco, M. O. (2000). Noise in neurons is message dependent. Proc. Natl. Acad. Sci. U.S.A., 97, 5557–55561. Collins, J. J., Chow, C. C., & Imhoff, T. T. (1995). Stochastic resonance without tuning. Nature, 376, 236–238. Crauel, H. (1990). Extremal exponents of random dynamical systems do not vanish. J. Dynam. Differ. Equ., 2, 245–291. de Melo, W., & van Strien, S. (1993). One dimensional dynamics. Berlin: SpringerVerlag. Freidlin, M. I., & Wentzell, A. D. (1998). Random perturbations of dynamical systems (2nd ed.): Berlin: Springer-Verlag. Gutkin, B. S., & Ermentrout, G. B. (1998). Dynamics of membrane excitability determine interspike interval variability: A link between spike generation mechanisms and cortical spike train statistics. Neural Computation, 10, 1047– 1065. Haag, J., & Borst, A. (1998).Encoding of visual motion information and reliability in spiking and graded potential neurons. J. Neurosci., 17, 4809–4819. Holden, A. V. (1976). Models of the stochastic activity of neurones. Berlin: SpringerVerlag. Hunter, J. D., Milton, J. G., Thomas, P. J., & Cowan, J. D. (1998). Resonance effect for neural spike time reliability. J. Neurophysiol., 80, 1427–1438. Kaijser, T. (1993). On stochastic perturbations of iterations of circle maps. Physica D, 68, 201–231. Kroller, ¨ J., Grusser, ¨ O. J, & Weiss, L. R. (1988). Observations on phase-locking within the response of primary muscle spindle afferents to pseudo-random stretch. Biol. Cybern., 59, 49–54. Kurrer, C., & Schulten, K. (1995). Noise-induced synchronous neuronal oscillations. Phys. Rev. E, 51, 6213–6218. Le Jan, Y. (1987). Equilibre statistique pour les produits de diffÂeomorphismes alÂeatoires inde pendants. Ann. Inst. Henri PoincarÂe Probab. Statis., 23, 111– 120. Longtin, A. (1993). Stochastic resonance in neuron models. J. Stat. Phys., 70, 309–327. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Pakdaman, K., & Mestivier, D. (2001). External noise synchronizes forced oscillators. Phys. Rev. E, 64, 030901(R). Pakdaman, K., & Tanabe, S. (2001). Random dynamics of the Hodgkin-Huxley neuron model. Phys. Rev. E, Rapid Communication, 64: 050902(R) (2001). Pei, X., Wilkens, L., & Moss, F. (1996). Noise-mediated spike timing precision from aperiodic stimuli in an array of Hodgkin-Huxley-type neurons. Phys. Rev. Lett., 77, 4679–4682.
792
K. Pakdaman
Reinagel, P., & Reid, R. C. (2000). Temporal coding of visual information in the thalamus. J. Neurosc., 20, 5392–5400. Segundo, J. P., Vibert, J. F., Pakdaman, K., Stiber, M., & Diez-MartÂõ nez, O. (1994). Noise and the neurosciences: A long history with a recent revival (and some theory). In K. Pribram (Ed.), Origins: Brain and self organization (pp. 299–331). Hillside, NJ: Erlbaum. Shimokawa, T., Pakdaman, K., & Sato, S. (1999a). Time-scale matching in the response of a leaky integrate and re neuron model to periodic stimulus with additive noise. Phys. Rev. E, 59, 3427–3443. Shimokawa, T., Pakdaman, K., & Sato, S. (1999b). Mean discharge frequency locking in the response of a noisy neuron model to subthreshold periodic stimulation. Phys. Rev. E, 60, 33–36. Shimokawa, T., Pakdaman, K., Takahata, T., Tanabe, S., & Sato, S. (2000). A rst-passage-time analysis of the periodically forced noisy leaky integrateand-re model. Biol. Cybern., 83, 327–340. Shimokawa, T., Rogel, A., Pakdaman, K., & Sato, S. (1999). Stochastic resonance and spike timing precision in an ensemble of leaky integrate and re neuron models. Phys. Rev. E, 59, 3461–3470. Shinomoto, S., & Kuramoto, Y. (1986).Phase transitions in active rotator systems. Progr. Theor. Phys., 75, 1105–1110. Sigeti, D., & Horsthemke, W. (1989). Pseudo regular oscillations induced by external noise. J. Stat. Phys., 54, 1217–1222. Stein, R. B., French, A. S., & Holden, A. V. (1972). The frequency response, coherence, and information capacity of two neuronal models. Biophys. J., 12, 295–322. Stratanovich, R. L. (1967). Topics in the theory of random noise (Vol. 2). New York: Gordon and Breach. Tanabe, S., & Pakdaman, K. (2001). Noise induced transition in neuronal models. Biol. Cybern., 85, 269–280. Tanabe, S., & Pakdaman, K. (in press). Noise enhanced neuronal reliability. Phys. Rev. E. Tanabe, S., Sato, S., & Pakdaman, K. (1999). Response of an ensemble of noisy neuron models to a single input. Phys. Rev. E, 60, 7235–7238. Tanabe, S., Shimokawa, T., Sato, S., & Pakdaman, K. (1999). Response of coupled noisy excitable systems to weak stimulation. Phys. Rev. E, 60, 2182–2185. White, J. A., Rubinstein, J. T., & Kay, A. R. (2000). Channel noise in neurons. Trends in Neurosc., 23, 131–137. Wiesenfeld, K., Pierson, D., Pantazelou, E., & Moss, F. (1994). Stochastic resonance on a circle. Phys. Rev. Lett., 72, 2125–2129. Received January 23, 2001; accepted June 18, 2001.
LETTER
Communicated by Misha Tsodyks
A Proposed Function for Hippocampal Theta Rhythm: Separate Phases of Encoding and Retrieval Enhance Reversal of Prior Learning Michael E. Hasselmo
[email protected] Clara Bodel Âon
[email protected] Bradley P. Wyble
[email protected] Department of Psychology, Program in Neuroscience and Center for BioDynamics, Boston University, Boston, MA 02215, U.S.A.
The theta rhythm appears in the rat hippocampal electroencephalogram during exploration and shows phase locking to stimulus acquisition. Lesions that block theta rhythm impair performance in tasks requiring reversal of prior learning, including reversal in a T-maze, where associations between one arm location and food reward need to be extinguished in favor of associations between the opposite arm location and food reward. Here, a hippocampal model shows how theta rhythm could be important for reversal in this task by providing separate functional phases during each 100–300 msec cycle, consistent with physiological data. In the model, effective encoding of new associations occurs in the phase when synaptic input from entorhinal cortex is strong and long-term potentiation (LTP) of excitatory connections arising from hippocampal region CA3 is strong, but synaptic currents arising from region CA3 input are weak (to prevent interference from prior learned associations). Retrieval of old associations occurs in the phase when entorhinal input is weak and synaptic input from region CA3 is strong, but when depotentiation occurs at synapses from CA3 (to allow extinction of prior learned associations that do not match current input). These phasic changes require that LTP at synapses arising from region CA3 should be strongest at the phase when synaptic transmission at these synapses is weakest. Consistent with these requirements, our recent data show that synaptic transmission in stratum radiatum is weakest at the positive peak of local theta, which is when previous data show that induction of LTP is strongest in this layer.
Neural Computation 14, 793–817 (2002)
° c 2002 Massachusetts Institute of Technology
794
M. E. Hasselmo, C. Bodel Âon, and B. P. Wyble
1 Introduction
The hippocampal theta rhythm is a large-amplitude, 3–10 Hz oscillation that appears prominently in the rat hippocampal electroencephalogram during locomotion or attention to environmental stimuli and decreases during immobility or consummatory behaviors such as eating or grooming (Green & Arduini, 1954; Buzsaki, Leung, & Vanderwolf, 1983). In this article, we link the extensive data on physiological changes during theta to the specic requirements of behavioral reversal tasks. Theta rhythm is associated with phasic changes in the magnitude of synaptic currents in different layers of the hippocampus, as shown by current source density analysis of hippocampal region CA1 (Buzsaki, Czopf, Kondakor, & Kellenyi, 1986; Brankack, Stewart, & Fox, 1993; Bragin et al., 1995). As summarized in Figure 1, at the trough of the theta rhythm recorded at the hippocampal ssure, afferent synaptic input from the entorhinal cortex is strong and synaptic input from region CA3 is weak (Brankack et al., 1993), but long-term potentiation at synapses from region CA3 is strong (Holscher, Anwyl, & Rowan, 1997; Wyble, Hyman, Goyal, & Hasselmo, 2001). In contrast, at the peak of the ssure theta rhythm, afferent input from entorhinal cortex is weak, the amount of synaptic input from region CA3 is strong (Brankack et al., 1993), but stimulation of synaptic input from region CA3 induces depotentiation (Holscher et al., 1997).
1.1 Role of Theta Rhythm in Behavior. This article relates these phasic changes in synaptic properties to data showing behavioral effects associated with blockade of theta rhythm. Physiological data suggest that rhythmic input from the medial septum paces theta-frequency oscillations recorded from the hippocampus and entorhinal cortex (Stewart & Fox, 1990; Toth, Freund, & Miles, 1997). This regulatory input from the septum enters the hippocampus via the fornix, and destruction of this ber pathway attenuates the theta rhythm (Buzsaki et al., 1983). Numerous studies show that fornix lesions cause strong impairments in reversal of previously learned behavior (Numan, 1978; M’Harzi et al., 1987; Whishaw & Tomie, 1997), including reversal of spatial response in a T-maze. The T-maze task is shown schematically in Figure 2. During each trial in a T-maze reversal, a rat starts in the stem of the maze and nds food reward when it runs down the stem and into one arm of the maze. After extensive training of this initial association, the food reward is moved to the opposite arm of the maze, and the rat must extinguish the association between left arm and food, and learn the new association between right arm and food. Rats with fornix lesions make more errors after the reversal, continuing to visit the old location, which is no longer rewarded. In the analysis and simulations presented here, we test the hypothesis that phasic changes in the amount of synaptic input and synaptic modica-
A Proposed Function for Hippocampal Theta Rhythm
795
Figure 1: Schematic representation of the change in dynamics during hippocampal theta rhythm oscillations. (Left) Appropriate dynamics for encoding. At the trough of the theta rhythm in the EEG recorded at the hippocampal ssure, synaptic transmission arising from entorhinal cortex is strong. Synaptic transmission arising from region CA3 is weak, but these same synapses show a strong capacity for long-term potentiation. This allows the afferent input from entorhinal cortex to set patterns to be encoded, while preventing interference from previously encoded patterns on the excitatory synapses arising from region CA3. (Right) Appropriate dynamics for retrieval. At the peak of the theta rhythm, synaptic transmission arising from entorhinal cortex is relatively weak (though strong enough to provide retrieval cues). In contrast, the synaptic transmission arising from region CA3 is strong, allowing effective retrieval of previously encoded sequences. At this phase, synapses undergo depotentiation rather than LTP, preventing further encoding of retrieval activity and allowing forgetting of incorrect retrieval.
tion during cycles of the hippocampal theta rhythm could enhance reversal of prior learning. Specically, these phasic changes allow new afferent input (from entorhinal cortex) to be strong when synaptic modication is strong, to encode new associations between place and food reward, without interference caused by synaptic input (from region CA3) representing retrieval of old associations. Synaptic input from region CA3 is strong on a separate phase when depotentiation at these synapses can allow extinction of old associations.
796
M. E. Hasselmo, C. Bodel Âon, and B. P. Wyble
Figure 2: (A) Schematic T-maze illustrating behavioral response during different stages in the task. Location of food reward is marked with the letter F. (A1 ) Initial learning of left turn response. (A2 ) Error after reversal. No food reward is found in the left arm of the maze. (A3 ) Correct after reversal. After a correct response, food reward is found in the right arm of the maze. (A4 ) Retrieval choice. Analysis focuses on performance at the choice point in a retrieval trial (r), when network activity reects memory for food location. (B) Functional components of the network during different trials in a behavioral task. (B1 ) Initial encoding trials (n). When in the left arm, the rat encodes associations between the left arm CA3 place cell activity (L) and left arm food reward activity (F). Thick lines represent strengthened synapses. (B2 ) Performance of an error after reversal (e). When in the left arm (L), no food reward is found. This results in weakening of the previous association (thinner line) between left arm place and left arm food reward. (No food reward activity appears because the rat does not nd reward in error trials). (B3 ) Encoding of correct reversal trial (c). When in the right arm, food reward is found. This results in encoding of an association between right arm CA3 place cell activity (R) and right arm food reward (F) activated by entorhinal input. (B4 ) Retrieval of correct reversal trial (r). This test focuses on activity observed when the rat is at the choice point (Ch), allowing retrieval of both arm place representations (L and R), and potential retrieval of their food reward associations.
A Proposed Function for Hippocampal Theta Rhythm
797
1.2 Overview of the Model. In the model, we focus on the strengthening and weakening of associations between location and food reward. These associations are encoded by modifying synaptic connections between neurons in region CA3 and region CA1 representing location (place cells) and food reward. The place representations are consistent with evidence of hippocampal neurons showing place-selective responses in a variety of spatial environments (McNaughton, Barnes, & O’Keefe, 1983; Skaggs, McNaughton, Wilson, & Barnes, 1996). The food reward representations are consistent with studies showing unit activity selective for the receipt of reward during behavioral tasks (Wiener, Paul, & Eichenbaum, 1989; Otto & Eichenbaum, 1992; Hampson, Simeral, & Deadwyler, 1999; Young, Otto, Fox, & Eichenbaum, 1997). Units have been shown to be selective for the receipt of reward at a particular location (Wiener et al., 1989), and rewarddependent responses have been shown in both region CA1 (Otto & Eichenbaum, 1992) and entorhinal cortex (Young et al., 1997). This model assumes place cell representations already exist and does not focus on their formation or properties, which have been modeled previously (Kali & Dayan, 2000). Instead, the mathematical analysis focuses on encoding and retrieval of associations between place cell activity in each maze arm and the representation of food reward. The structure and phases in the model are summarized in Figure 1. The analysis presented here suggests that oscillatory changes in magnitude of synaptic transmission, long-term potentiation, and postsynaptic depolarization during theta rhythm cause transitions between two functional phases within each theta cycle. In the encoding phase, entorhinal input is dominant, activating cells in regions CA3 and CA1. At this time, excitatory connections arising from region CA3 have decreased transmission, but these same synapses show enhanced long-term potentiation to form associations between sensory events. In the retrieval phase, entorhinal input is relatively weak, but still brings retrieval cues into the network. At this time, excitatory synapses arising from region CA3 show strong transmission, allowing retrieval of the previously learned associations. During this retrieval phase, long-term potentiation must be absent in order to prevent encoding of the retrieval activity. Depotentiation or long-term depression (LTD) during this phase allows extinction of prior learned associations. We propose that these two phases appear in a continuous interleaved manner within each 100–300 msec cycle of the theta rhythm. Section 2 presents mathematical analysis showing how theta oscillations could play an important role in reversal of learned associations. First, we describe the assumptions for modeling synaptic transmission, synaptic modication, and phasic changes in these variables during theta rhythm. Then we describe the assumptions for modeling patterns of activity representing sensory features of the behavioral task in the T-maze. Subsequently, we present a criterion for evaluating the function of the model, in the form of a performance measure based on reversal of learned associations in the
798
M. E. Hasselmo, C. Bodel Âon, and B. P. Wyble
T-maze. Finally, we show how specic phase relationships between network variables provide the best performance of the model. These phase relationships correspond to those observed with physiological recording during hippocampal theta rhythm. 2 Mathematical Analysis of Theta Rhythm
We modeled hippocampal activity when a rat is in one of the two arms of a T-maze during individual trials performed during different stages of a reversal task. A trial refers to the period from the time the rat reaches the end of an arm to its removal from the arm. This period can cover several cycles of theta. A stage of the task refers to multiple trials in which the spatial response of the rat and the food reward location are the same. We represent the pattern of activation of pyramidal cells in region CA1 with the the vector aCA1 ( t) . (See the appendix for a glossary of mathematical terms.) As shown in Figures 1 and 2, the pattern of activity in CA1 is inuenced by two main inputs. First is the synaptic input from region CA3, which is a product of the strength of modiable Schaffer collateral synapses WCA3 (t) and the activity of CA3 pyramidal cells aCA3 ( t). The activity pattern in CA3 remains constant across trials within a single behavioral stage of the task but changes in different stages depending on the spatial response of the rat. The weight matrix remains constant within each trial but can change at the end of the trial. Second is the synaptic input from entorhinal cortex, which is a product of the strength of perforant path synapses W EC (which are not potentiated here) and the activity of projection neurons in layer III of entorhinal cortex aEC (t ). The activity pattern in entorhinal cortex also remains constant within each stage of the task but changes in different stages depending on the location of food reward. The weight matrix W EC does not change in this model. Individual elements of the vectors aCA3 ( t) and aEC ( t) assume only two values, zero or q, to represent presence or absence of spiking activity in each neuron. However, these inuences make the activity of neurons in region CA1 aCA1 ( t) take a range of values. The following equation represents the effects of synaptic transmission on region CA1 activity: aCA1 ( t) D WEC aEC ( t) C WCA3 ( t) aCA3 ( t) .
(2.1)
The structure of the model is summarized in Figures 2 and 3. 2.1 Phasic Changes in Synaptic Input During Theta. This article focuses on oscillatory modulation of equation 2.1 to represent phasic changes during theta rhythm. Experiments have shown phasic changes in the amount of synaptic input in stratum radiatum of region CA1 (Rudell & Fox, 1984; Buzsaki et al., 1986; Brankack et al., 1993; Bragin et al., 1995; Wyble, Linster, & Hasselmo, 2000). These changes in amount of synaptic input could
A Proposed Function for Hippocampal Theta Rhythm
799
depend on the magnitude of transmitter release or on other factors, such as the ring rate of afferent neurons. Both factors have the same functional inuence in the current model and are represented as follows. The synaptic weights, WCA3 ( t), are multiplied by a sine wave function, hCA3 (t) , which has phase w CA3 and is scaled to vary between 1 and (1–X), as shown in equation 2.2: hCA3 (t) D X / 2¤ sin(t C w CA3 ) C (1 ¡ X / 2) .
(2.2)
The parameter X represents the magnitude of change in synaptic currents in different layers and was constrained to values between 0 and 1(0 < X < 1). The simulation also includes phasic changes in magnitude of entorhinal input hEC ( t) , as shown in equation 2.3: hEC ( t) D X / 2¤ sin(t C w EC ) C (1 ¡ X / 2) .
(2.3)
These changes in entorhinal input are consistent with current source density (CSD) data on stratum lacunosum-moleculare (s. l-m) in region CA1 (Brankack et al., 1993; Bragin et al., 1995) and phasic activity of entorhinal neurons (Stewart, Quirk, Barry, & Fox, 1992). The analysis presented here shows how the best performance of the model is obtained with relative phases of these variables that are shown in Figures 1 and 3. With inclusions of these phasic changes, equation 2.1 becomes aCA1 (t ) D hEC ( t) WEC aEC ( t) C hCA3 ( t) W CA3 ( t) aCA3 ( t).
(2.4)
2.2 Phasic Changes in LTP During Theta. Long-term potentiation in the hippocampus has repeatedly been shown to depend on the phase of stimulation relative to theta rhythm in the dentate gyrus (Pavlides, Greenstein, Grudman, & Winson, 1988; Orr, Rao, Stevenson, Barnes, & McNaughton, 1999) and region CA1 (Huerta & Lisman, 1993; Holscher et al., 1997; Wyble et al., 2001). In region CA1, experiments have shown that LTP can be induced by stimulation on the peak, but not the trough, of the theta rhythm recorded in stratum radiatum in slice preparations (Huerta & Lisman, 1993) urethane-anesthetized rats (Holscher et al., 1997), and awake rats (Wyble et al., 2001). Note that theta recorded locally in radiatum will be about 180 degrees out of phase with theta recorded at the ssure. Based on these data, the rate of synaptic modication of the Schaffer collateral synapses (WCA3 (t) ) in the model was modulated in an oscillatory manner given by
hLTP ( t) D sin(t C w LTP ) .
(2.5)
This long-term modication requires postsynaptic depolarization in region CA1 aCA1 ( t) , and presynaptic spiking activity from region CA3 aCA3 ( t). For simplicity, we assume that the synaptic weights during each new trial
800
M. E. Hasselmo, C. Bodel Âon, and B. P. Wyble
Figure 3: Summary of experimental data on changes in synaptic potentials and long-term potentiation at different phases of the theta rhythm. Dotted lines indicate the phase of theta with best dynamics for encoding (E) and the phase with best dynamics for retrieval (R). Entorhinal input: Synaptic input from entorhinal cortex in stratum lacunosum-moleculare is strongest at the peak of theta recorded in stratum radiatum, equivalent to the trough of theta recorded at the hippocampal ssure (Buzsaki et al., 1986; Branckak et al., 1993; Bragin et al., 1995). Soma: Intracellular recording in vivo shows the depolarization of the soma in phase with CA3 input (Kamondi et al., 1998). However, dendritic depolarization is in phase with entorhinal input (Kamondi et al., 1998). LTP: Long-term potentiation of synaptic potentials in stratum radiatum is strongest at the peak of local stratum radiatum theta rhythm in slice preparations (Huerta & Lisman, 1993) and urethane anesthetized rats (Holscher et al., 1997). Strong entorhinal input and strong LTP in stratum radiatum is associated with the encoding phase (E). CA3 input: Synaptic potentials in stratum radiatum are strongest at the trough of the local stratum radiatum theta rhythm as demonstrated with current source density analysis (Buzsaki et al., 1986; Branckak et al., 1993; Bragin et al., 1995) and studies of evoked potentials (Rudell & Fox, 1984; Wyble et al., 2000). Strong synaptic transmission in stratum radiatum is present during the retrieval phase (R).
A Proposed Function for Hippocampal Theta Rhythm
801
remain constant at their initial values and change only at the end of the trial. This is consistent with delayed expression of LTP after its initial induction. We also assume that synapses do not grow beyond a maximum strength. We can compute the change in synaptic weight at the end of a single behavioral trial m C 1, which depends on the integral of activity in the network during that trial (from the time at the start of that trial, tm , to the time at the end of that trial, tm C 1 ). For mathematical simplicity, we integrate across full cycles of theta: tm C 1 Z D WCA3 ( tm C1 ) D hLTP ( t) aCA1 (t) aCA3 ( t) T dt.
(2.6)
tm
The dependence on pre- and postsynaptic activity could be described as Hebbian (Gustafsson, Wigstrom, Abraham, & Huang, 1987). However, the additional oscillatory components of the equation go beyond strict Hebbian properties and can cause the synaptic weight to weaken if pre- and postsynaptic activity occurs during a negative phase of the oscillation. This article focuses on how the quality of retrieval performance in region CA1 depends on the phase relationship of the oscillations in synaptic transmission (in equation 2.4) with the oscillations in synaptic modication (in equation 2.6). Combining these equations gives us tm C 1 Z D WCA3 ( tm C1 ) D hLTP ( t) [hEC ( t) WEC aEC ( t) tm
C hCA3 ( t ) W CA3 ( tm ) aCA3 ( t ) ]aCA3 ( t ) T dt.
(2.7)
Note that we allow theta phasic modulation of W EC , but we are not modeling Hebbian synaptic modication of WEC . Therefore, after this point, we will not include that matrix of connectivity but will represent patterns of input from entorhinal cortex with single static vectors, as if the synapses were an identity matrix. 2.3 Modeling Learning of Associations in a Reversal Task. The mathematical analysis focuses on the performance of the network in the T-maze reversal task, in which learning of an association between one location and food must be extinguished and replaced by a new association. Our analysis focuses on synaptic changes occurring during initial learning of an association (e.g., between left arm and food), subsequent extinction of this association when it is invalid, and encoding of a new association (e.g., between right arm and food). Thus, we focus on the mnemonic aspects of this task. Changes in the synaptic matrix WCA3 ( t) include the learning during three sequential stages of the task, whereas for the analysis of performance,
802
M. E. Hasselmo, C. Bodel Âon, and B. P. Wyble
a fourth stage is also needed. These stages are described as follows: 1. Initial learning. During this period, the rat runs to the left arm of the T-maze and nds food reward in that arm (see Figure 2, A1 and B1 ). Synaptic modication encodes associations between left arm place (represented by CA3 activity) and left arm food reward (represented by entorhinal activity) on the n trials with time ranging from t D 0 to t D tn . During this behavioral stage, activity in CA3 and EC is constant ( n ) to indicate the activity during all n trials. We use the superindex ± ² ± ² ( )
( )
n n during this stage in region CA3 aCA3 and in entorhinal cortex aEC .
2. Reversal learning with erroneous responses. After stage 1, the food reward is changed to the right arm. In the initial stage after this reversal, the rat runs to the left arm place but does not nd the food reward (see Figure 2, A2 and B2 ). This causes extinction or weakening of the associations between left arm place and left arm food reward. We denote with the superindex ( e) the activity during this stage. The left arm place representation in CA3 is the same as during initial learning, and for simplicity we assume the CA3 activity vector for initial trials has unity dot product with the activity vector for erroneous reversal ( ) T ( )
n e trials aCA3 aCA3 D 1. Because the rat does not encounter food reward on erroneous trials, the EC activity vector representing food reward is ( e) zero aEC D 0. This period contains e trials and ends at tn C e . This stage could vary in number of trials in individual rats.
3. Reversal learning with correct responses. During this period, the rat runs to the right arm and nds food (see Figure 2, A3 and B3 ). In this stage, the superindex will be ( c) . Synaptic modication encodes associations ( c) ( c) between right arm place aCA3 and right arm food reward aEC . This period contains c trials and ends at tn C e C c . Note that we make the natural assumption that the CA3 activity representing left arm place cell responses during initial and erroneous trials has a zero dot product with the right arm place cell responses during correct reversal trials ( ) T ( )
( ) T ( )
n c e c aCA3 aCA3 D aCA3 aCA3 D 0. We also assume that the entorhinal input for food in the right arm has zero dot product with the input for ( c ) T ( n) ) aEC D 0 and that the dot product of food food in the left arm ( aEC reward input from entorhinal cortex for the same food location is unity ³± ² ´ ± ²T (c ) T ( c ) ( n) ( n) aEC aEC D aEC aEC D 1 .
4. Retrieval at choice point on subsequent trials. The analysis of performance in the network focuses on retrieval activity at the choice point of the T-maze (see Figure 2, A4 and B4 ). This stage is described further in section 2.4.
A Proposed Function for Hippocampal Theta Rhythm
803
The association between left arm place and food reward, learned on the initial n trials (see Figure 2, A1 and B1 ), is represented by the matrix WCA3 (tn ). Each entry of this matrix does not grow beyond a maximum, which is a factor of K times the pre- and postsynaptic activity. This matrix represents the association±between left arm place cells activated during these trials ² ( )
n in region CA3 aCA3 and food reward representation cells in region CA1 ± ² ( n) activated by entorhinal input aCA1 D aEC . At the end of the n trials, this matrix becomes: ( ) ( ) T
n n WCA3 ( tn ) D KaEC aCA3 .
(2.8)
After reversal, the extinction of the association between left arm and reward occurs on erroneous trials (see Figure 2, A2 and B2 ). The learning during each trial is represented by summing up integrals of equation 2.7 over each trial from n C 1 to n C e. As noted above, the lack of reward is ( e) represented by an absence of entorhinal input aEC D 0. However, retrieval of prior learning can occur during these erroneous trials due to left arm food representations activated by the spread of activity from left arm place cells in region CA3 over the matrix (see equation 2.8) from the rst n trials T
( n) ( e) (recall that aCA3 aCA3 D 1). This appears in the following equation:
WCA3 (tn C e ) D W CA3 (tn ) C tk C 1 n Ce Z h i X (e ) ( e) T hLTP (t) hCA3 ( t) WCA3 ( tk ) aCA3 aCA3 dt.
(2.9)
kD n C 1 t k
In contrast with incorrect trials, correct trials after reversal are indexed with the numbers between n C e C 1 and n C e C c (see Figure 2, A3 and B3 ), where c represents correct postreversal trials. The desired learning during each trial c consists of the association between place cells active in region CA3 when ( ) T
c the rat is at the end of the right arm aCA3 and the food representation is ( )
( ) T ( )
c n c activated by entorhinal input aEC . As noted above, aCA3 aCA3 D 0, so the ( c) spread across connections modied in initial learning is W CA3 ( tn ) aCA3 D ( ) ( ) T ( )
n n c kaEC aCA3 aCA3 D 0. We assume these trials occur sequentially after the error trials described above. However, with the assumption of nonoverlapping place representations in CA3, even if the erroneous and correct trials were intermixed, the same results would apply. The learning on these correct trials is
WCA3 (tn C e C c ) D WCA3 (tn C e ) C
804
M. E. Hasselmo, C. Bodel Âon, and B. P. Wyble Ce Cc nX
ZtkC 1 h i ( c) ( c) C hCA3 ( t ) WCA3 ( t k ) aCA3 hLTP ( t) hEC ( t) aEC
kD n C e C 1 t k ( ) T
c £ aCA3 dt.
(2.10)
2.4 Performance Measure. The behavior of the network was analyzed with a performance measure M, which measures how much the network retrieves the new association between right arm and food reward relative to retrieval of the old association between left arm and food reward. We focus ( r) on a retrieval trial r, with place cell activity in region CA3, given by aCA3 , but before the rat has encountered the food reward, so that no food input enters (r) the network aEC D 0. For the performance measure, we focus on retrieval of memory when the rat is at the choice point (the T-junction). At the choice point, activity spreads along associations between place cells within CA3, such that the place cells in the left and right arms are activated. These place cells can then activate any associated food representation in region CA1 (see Figure 2, A4 and B4 ). Thus, we use an activity pattern in region CA3 ( r) , which overlaps with the region CA3 place activity for both the left arm aCA3 ( ) T ( )
( ) T ( )
n r c r aCA3 aCA3 D 1 and the right arm aCA3 aCA3 D 1. This assumes that activity dependent on future expectation can be induced when the rat is still in the stem or at the T-junction. This is consistent with evidence for selectivity of place cells dependent on future turning response (Wood, Dudchenko, Robitsek, & Eichenbaum, 2000; Frank, Brown, & Wilson, 2000). Note that when the rat is at the choice point on trial r (see Figure 2, A4 and B4 ), the CA1 activity will be represented by an equation that uses the weight matrix shown in equation 2.10: ( )
( )
r r C hCA3 ( t ) (W CA3 (tn C e C c ) ) aCA3 . aCA1 ( t) D hEC ( t) aEC
(2.11)
The performance measure then takes the retrieval activity aCA1 (t) and evaluates its similarity (dot product) with the food representation on the ( c) correct reversal trial aEC minus its similarity (dot product) with the now( n) incorrect memory of food from initial trials aEC . Due to the oscillations of CA3 input to CA1 during this period, we use the maximum, max[], of this measure as an indicator of performance, but similar results would be obtained with integration over time during retrieval. In the model, the pattern of food reward activity in region CA1, evaluated during the choice point period on this trial r, is taken as guiding the choice on this trial (the choice will depend on the relative strength of the activity representing left arm food reward versus the strength of activity representing right arm food reward). Thus, this equation measures the tendency that the response guided by region CA1 activity on this trial will depend on (c ) memory of associations from correct postreversal trials aEC (left side of the
A Proposed Function for Hippocampal Theta Rhythm
805
equation) versus memory for prior strongly learned associations on trials (n ) aEC (right side of the equation): h i ( c) T ( n) T (aEC ) aCA1 ( t) ¡ ( aEC ) aCA1 (t) . (2.12) M D max ( t2 ( tr ,tr C e ) )
To see how this measure of behavior changes depending on the phase of oscillatory variables, we can replace the matrices in equation 2.11 with the components from equations 2.8 through 2.10 and obtain the postsynaptic activity for inclusion in equation 2.12. For the purpose of simplication, we will consider retrieval after a single error trial and a single correct trial. This allows removal of the summation signs and use of the end point of a single error trial (tn C e D tn C 1 ) and the end point of a single correct trial ( tn C e C c D tn C 2 ). We move the constant values outside the integrals. In simplifying this equation, we use the assumptions about activity in different stages itemized at the start of sections 2.3 and 2.4. These assumptions allow the performance measure to be reduced to an interaction of the oscillatory terms: 8 2 > < tZnC e C c 6 hLTP (t )hEC ( t) dt ¡ K M D max 4hCA3 (t ) ( t2 ( tr ,tr C e ) ) > : tn C e
93 Ztn C e = ¡ hLTP ( t)hCA3 (t ) dt 5 . ;
(2.13)
tn
Similarly, we can consider the analytical conditions that would allow the previously learned associations to be erased. To accomplish this during the error stage, we desire that the previously strengthened connections should go toward zero, so we need Ztk C 1 h i (e ) ( e) T hLTP (t ) hCA3 ( t) WCA3 ( tk ) aCA3 aCA3 dt < 0. tk
If we ignore the constant vectors and focus on a single theta cycle, we obtain Z2p
sin(t C w LTP ) ( X / 2) sin (t C w CA3 ) dt C
0
Z2p
sin(t C w LTP ) (1 ¡ X / 2) dt.
0
This gives (X / 4)
Z2p [cos (w LTP ¡ w CA3 ) ¡ cos (2t C w CA3 C w LTP ) ] dt 0
D ( X / 2)p cos (w LTP ¡ w CA3 ) .
806
M. E. Hasselmo, C. Bodel Âon, and B. P. Wyble
This equation will be negative for p 3p < w LTP ¡ w CA3 < . 2 2 Thus, for the old synapses to decay, the oscillations of CA3 input and LTP should be out of phase. A similar analysis shows that the synapses strengthened by the correct association in equation 2.10 will grow for ¡
p p < w LTP ¡ w EC < , 2 2
that is, if the entorhinal input is primarily in phase with synaptic modication. These same techniques can be used to solve equation 2.13, which over a single theta cycle becomes: M D ( X / 2)p cos (w LTP ¡ w EC ) ¡ K ¡ ( X / 2)p cos (w LTP ¡ w CA3 ) . (2.14) Figure 4 shows this equation for X D K D 1. Note that with changes in the value of X (the magnitude of synaptic transmission in equations 2.2 and 2.3), the performance measure changes amplitude but has the same overall shape of dependence on the relative phase of network variables. Increases in the variable M correspond to more retrieval of correct associations between correct right arm location and food after the reversal and a decreased retrieval of incorrect associations between left arm location and food. As illustrated in Figure 4, this function is maximal when hLTP (t) and hCA3 ( t) have a phase difference of 180 degrees (p ) and hLTP ( t) and hEC ( t) have a phase difference near zero. In this framework, the peak phase of hLTP ( t) could be construed as an encoding phase, and the peak phase of hCA3 ( t) could be seen as a retrieval phase. These are the phase relationships shown in Figures 1 and 3. This analysis shows how a particular behavioral constraint—the ability to reverse a learned behavior—requires specic phase relationships between physiological variables during theta rhythm. 2.5 Relation Between Theoretical Phases and Experimental Data. The phase relationship between oscillating variables that provides the best performance in equation 2.14 (see Figure 4) corresponds to the phase relationships observed in physiological experiments on synaptic currents (Buzsaki et al., 1986; Brankack et al., 1993; Bragin et al., 1995) and LTP (Huerta & Lisman, 1993; Holscher et al., 1997; Wyble et al., 2001), as summarized in Figures 1 and 3. In particular, our model suggests that the period of strongest potentiation at excitatory synapses arising from region CA3 should correspond to the period of weakest synaptic input at these same synapses. These synapses terminate in stratum radiatum of region CA1. Experimental data have shown that stimulation of stratum radiatum can best induce LTP at the peak of local theta in urethane anesthetized rats (Holscher et al., 1997).
A Proposed Function for Hippocampal Theta Rhythm
807
Figure 4: Performance of the network depends on the relative phase of oscillations in synaptic transmission and long-term potentiation. This graph shows how performance in T-maze reversal changes as a function of the phase difference between LTP and synaptic input from entorhinal cortex (EC phase) and of the phase difference between LTP and synaptic input from region CA3 (CA3 phase). This is a graphical representation of equation 2.14. Best performance occurs when EC input is in phase (zero degrees) with LTP but region CA3 synaptic input is 180 degrees out of phase with LTP.
To analyze whether this corresponds to the time of weakest synaptic transmission, we have reanalyzed recent data on the size of evoked synaptic potentials during theta rhythm in urethane anesthetized rats (Wyble et al., 2000), using the slope of stratum radiatum theta measured directly before the evoked potential. As shown in Figure 5, the smallest slope of evoked potentials was noted near the peak of stratum radiatum theta, which is the time of peak LTP induction (Holscher et al., 1997). Consistent with the time of weakest evoked synaptic transmission, previous studies show the smallest synaptic currents in stratum radiatum at this phase of theta (Brankack et al., 1993). The analysis presented here provides a functional rationale for this paradoxical phase difference between magnitude of synaptic input and magnitude of long-term potentiation in stratum radiatum. Thus, the requirement for best performance in the model is consistent with physiological data showing that LTP of the CA3 input is strongest
808
M. E. Hasselmo, C. Bodel Âon, and B. P. Wyble
Figure 5: Data showing the magnitude of the initial slope of evoked synaptic potentials (EPSP slope) in stratum radiatum of hippocampal region CA1 relative to the slope of the local theta eld potential (local EEG slope) recorded from the same electrode just before recording of the evoked potential. These data are reanalyzed from a previous study correlating slope of potentials with hippocampal ssure theta (Wyble et al., 2000). The data demonstrate that the initial slope of evoked potentials is smallest near the peak of local theta, the same phase when induction of long-term potentiation is maximal (Holscher et al., 1997; Wyble et al., 2001).
when synaptic input at these synapses is weakest. The time of best LTP induction in stratum radiatum is also the time of least pyramidal cell spiking in region CA1 (Fox, Wolfson, & Ranck, 1986; Skaggs et al., 1996). The lack of pyramidal cell spiking could result from the fact that the pyramidal cell soma is hyperpolarized during this phase, while the proximal dendrites are simultaneously depolarized (Kamondi, Acsady, Wang, & Buzsaki, 1998). This suggests that synaptic modication at the peak of the theta rhythm in stratum radiatum requires that Hebbian synaptic modication at the Schaffer collaterals depends on postsynaptic depolarization caused by perforant pathway input from entorhinal cortex rather than associative input from Schaffer collaterals, but should not require somatic spiking activity. This is consistent with data on NMDA receptor responses to local depolarization and Hebbian long-term potentiation in the absence of spiking (Gustafsson et al., 1987). Experimental data already show that LTP is not induced at the trough of theta rhythm recorded in stratum radiatum (Huerta & Lisman, 1993; Holscher et al., 1997). In fact, consistent with the process of extinction of prior learning during retrieval, stimulation at the trough of local theta results
A Proposed Function for Hippocampal Theta Rhythm
809
in LTD (Huerta & Lisman, 1995) or depotentiation (Holscher et al., 1997). These data suggest that some process must suppress Hebbian long-term potentiation in this phase of theta, because this is the phase of strongest Schaffer collateral input (Brankack et al., 1993) and maximal CA1 spiking activity (Fox et al., 1986; Skaggs et al., 1996). The lack of LTP may result from lack of dendritic depolarization in this phase (Kamondi et al., 1998). As shown in Figure 6, the loss of theta rhythm after fornix lesions could underlie the impairment of reversal behavior in the T-maze caused by these lesions (M’Harzi et al., 1986). This gure summarizes how separate phases of encoding and retrieval allow effective extinction of the initially learned association and learning of the new association, whereas loss of theta results in retrieval of the initially learned association during encoding, which prevents extinction of this association. 2.6 Changes in Postsynaptic Membrane Potential During Theta. The above results extend previous proposals that separate encoding and retrieval phases depend on strength of synaptic input (Hasselmo, Wyble, & Wallenstein, 1996; Wallenstein & Hasselmo, 1997; Sohal & Hasselmo, 1998). This same model can also address the hypothesis that encoding and retrieval can be inuenced by phasic changes in the membrane potential of the dendrite (Hasselmo et al., 1996; Paulsen & Moser, 1998) and soma (Paulsen & Moser, 1998). Data show that during theta rhythm, there are phasic changes in inhibition focused at the soma (Fox et al., 1986; Fox, 1989; Kamondi et al., 1998). This phenomenon can be modeled by assuming that the CA1 activity described above represents dendritic membrane potential in pyramidal cells (adCA1 ) , and a separate variable represents pyramidal cell soma spiking ( asCA1 ) (Golding & Spruston, 1998). Spiking at the soma is inuenced by input from the dendrites ( adCA1 ) , multiplied by oscillatory changes in soma membrane potential (hsoma ) induced by shunting inhibition:
asCA1 D hsoma adCA1 .
(2.15)
If these oscillatory changes in soma membrane potential are in phase with region CA3 input hsoma , then this enhances the output of retrieval activity. If they are out of phase with entorhinal input hEC and LTP hLTP , this will help prevent interference due to network output during encoding, allowing dendritic depolarization to activate NMDA channels without causing spiking activity at the soma. This corresponds to the crossing of a threshold for synaptic modication without crossing a ring threshold for generating output. Use of a modication threshold enhances the inuence of even relatively small values for the modulation of synaptic transmission (the parameter X in equations 2.2, 2.3 and 2.14). A modication threshold would select a segment of each cycle during which postsynaptic activity is above threshold. Outside of this segment, the modication rate would be zero. To simplify the analysis of the integrals, we will take the limiting case as
810
M. E. Hasselmo, C. Bodel Âon, and B. P. Wyble
the modication threshold approaches 1. In this case, the postsynaptic component of potentiation becomes a delta function with amplitude one and phase corresponding to the time of the positive maximum of the sine wave for synaptic transmission, d ( t ¡w CA3 C p / 2). The integral for one component of equation 2.13 then becomes: Z2p sin (t C w LTP ) d ( t ¡ w CA3 C p / 2) dt D sin (p / 2 C w LTP ¡ w CA3 ) 0
D cos (w LTP ¡ w CA3 ) .
A Proposed Function for Hippocampal Theta Rhythm
811
This results in a performance measure very similar to equation 2.14, with a maximum only slightly smaller than that in equation 2.14, but with no dependence on magnitude of X: M D cos(w LTP ¡ w EC ) ¡ K ¡ cos (w LTP ¡ w CA3 ) .
(2.16)
The disjunction between dendritic and somatic depolarization has been observed experimentally (Golding & Spruston, 1998; Kamondi et al., 1998; Paulsen & Moser, 1998). Phasic changes in disinhibition could underlie phasic changes in population spike induction in stratum pyramidale during theta (Rudell, Fox, & Ranck, 1980; Leung, 1984), though population spikes in anesthetized animals appear to depend more on strength of synaptic input (Rudell & Fox, 1984). 3 Discussion
The model presented here shows how theta rhythm oscillations could enhance the reversal of prior learning by preventing retrieval of previously encoded associations during new encoding. We developed expressions for the phasic changes in synaptic input and long-term potentiation during
Figure 6: Facing page. Schematic representation of the encoding and extinction of associations in the model during normal theta (with theta) and after loss of theta due to fornix lesions (no theta). Circles represent populations of units in region CA3 representing arm location (left = L, right = R), and populations in region CA1 activated by food location. Arrows labeled with F represent sensory input about food reward arriving from entorhinal cortex. Filled circles represent populations activated by afferent input from entorhinal cortex or by retrieval activity spreading from region CA3. Initial learning: During initial encoding trials, simultaneous activity of left arm place cells and food reward input causes strengthening of connections between CA3 and CA1 (thicker line). This association can be retrieved in both normal and fornix lesioned rats. Error after reversal: With normal theta, when food reward is not present, the region CA1 neuron is not activated during encoding, so no strengthening occurs, and during the retrieval phase, depotentiation causes the connection storing the association between left arm and food to weaken (thinner line). In contrast, without theta, the activity spreading along the strengthened connection causes postsynaptic activity, which allows strengthening and prevents weakening of the association between left arm and food. Correct reversal: With normal theta, a new stronger association is formed between right arm and food. Without theta, the new and old associations are similar in strength. Retrieval choice: At the choice point on subsequent trials, retrieval of the association between right arm and food dominates with normal theta, whereas without theta the associations have similar strength.
812
M. E. Hasselmo, C. Bodel Âon, and B. P. Wyble
theta rhythm and for sensory input during different stages of behavior in a T-maze task. A performance measure evaluated memory for recent associations between location and food in this task by measuring how much retrieval activity resembles the new correct food location, as opposed to the initial (now incorrect) food location. As shown in Figure 4, the best memory for recent associations in this task occurs with the following phase relationships between oscillations of different variables: (1) sensory input from entorhinal cortex should be in phase (zero degrees difference) with oscillatory changes in long-term potentiation of the synaptic input from region CA3, and (2) retrieval activity from region CA3 should be out of phase (180 degree difference) with the oscillatory changes in long-term potentiation at these same synapses arising from region CA3. As shown in Figures 1 and 3, this range of best performance appears consistent with physiological data on the phase dependence of synaptic transmission (Buzsaki et al., 1986; Brankack et al., 1993; Bragin et al., 1995; Wyble et al., 2000) and long-term potentiation (Huerta & Lisman, 1993; Holscher et al., 1997; Wyble et al., 2001). This model predicts that recordings during tasks requiring repetitive encoding and retrieval of stimuli should show specic phase relationships to theta. These tasks include delayed matching tasks and delayed alternation (Numan, Feloney, Pham, & Tieber, 1995; Otto & Eichenbaum, 1992; Hampson, Jarrard, & Deadwyler, 1999). In these tasks, hippocampal neurons should show spiking activity at the peak of local theta in stratum radiatum for information being encoded and spiking activity near the trough of local theta for information being retrieved. This is consistent with theta phase precession (O’Keefe & Recce, 1993; Skaggs et al., 1996), in which the retrieval activity (before the animal enters its place eld) appears at different phases of theta than the stimulus-driven encoding activity (after the animal has entered the place eld). The encoding of new information will be most rapid if theta rhythm shifts phase to allow encoding dynamics during the time when new information is being processed. This is consistent with evidence that the theta rhythm shows phase locking to the onset of behaviorally relevant stimuli for stimulus acquisition (Macrides, Eichenbaum, & Forbes, 1982; Semba & Komisaruk, 1984; Givens, 1996). Phase locking could involve medial septal activity shifting the hippocampal theta into an encoding phase during the initial appearance of a behaviorally relevant stimulus, thereby speeding the process of encoding. In tasks requiring access to information encoded within the hippocampus, this information will become accessible during the retrieval phase, which could bias generation of responses to the retrieval phase. This requirement is consistent with the theta phase locking of behavioral response times (Buno & Velluti, 1977). The mechanisms described here could also be important for obtaining relatively homogeneous strength of encoding within a distributed environment (Kali & Dayan, 2000). In particular, the model contains an ongoing interplay between synaptic weakening during retrieval and synaptic
A Proposed Function for Hippocampal Theta Rhythm
813
strengthening during encoding. Without a prior representation, retrieval does not occur and synaptic weakening is absent, allowing pure strengthening of relevant synapses. However, for highly familiar portions of the environment, considerable retrieval will occur during the trough of theta, which will balance the new encoding at the peak. With a change in environment (e.g., insertion of a barrier or removal of food reward), the retrieval of prior erroneous associations in the model would no longer be balanced by reencoding of these representations, and the erroneous representations could be unlearned. (If retrieval occurs without external input, a return loop from entorhinal cortex would be necessary to prevent loss of the memory.) Without the separation of encoding and retrieval on different phases of theta, memory for the response performed during a trial cannot be distinguished from retrieval of a previous memory used to guide performance during that trial. In addition to the model of reversal learning presented above, the same principles could account for impaired performance caused by medial septal lesions in tasks such as spatial delayed alternation (Numan & Quaranta, 1990) and delayed alternation of a go/no-go task (Numan et al., 1995), which place demands on the response memory system. In contrast, medial septal lesions do not impair performance of a task requiring working memory for a sensory stimulus (Numan & Klis, 1992). Fornix lesions also impair delayed nonmatch to sample in a spatial T-maze in rats (Markowska, Olton, Murray, & Gaffan, 1989) and monkeys (Murray, Davidson, Gaffan, Olton, & Suomi, 1989), where animals must remember their most recent response in the face of interference from multiple previous responses. Performance of a spatial delayed match to sample task is also impaired by hippocampal lesions (Hampson, Jarrard, et al., 1999). This analysis can guide further experiments testing the predictions about specic phase relationships required by behavioral constraints in reversal and delayed alternation tasks. These behavioral constraints are ethologically relevant; they would arise in any foraging behavior where food sites are xed across some time periods but vary at longer intervals. Appendix: Glossary of Mathematical Terms ( )
n aEC = entorhinal activity. This is a xed vector during each stage of task. The stage is designated by superscript: n = initial learning, e = error after reversal, c = correct after reversal, r = retrieval test. ( )
n = region CA3 activity. Superscript designates stage of task. aCA3
aCA1 ( t) = region CA1 activity. Varies with time, tn = end of initial period, tn C e = end of erroneous trials, tn C e C c = end of correct trials. W CA3 ( t) = synaptic input matrix from region CA3. W EC = synaptic input matrix from entorhinal cortex. hCA3 ( t) = oscillatory modulation of synaptic transmission from CA3 during theta rhythm, with phase = w CA3 .
814
M. E. Hasselmo, C. Bodel Âon, and B. P. Wyble
hEC (t) = oscillatory modulation of synaptic transmission from entorhinal cortex during theta rhythm, with phase = w EC . hLTP ( t) = oscillatory modulation of rate of synaptic modication during theta rhythm, with phase = w LTP . M = performance measure in reversal task dependent on relative phase of oscillatory modulation.
Acknowledgments
We appreciate the assistance on experiments of Christiane Linster and Bradley Molyneaux. We also appreciate critical comments by Marc Howard, James Hyman, Terry Kremin, Ana Nathe, Anatoli Gorchetchnikov, Norbert Fortin, and Seth Ramus. In memory of Patricia Tillberg Hasselmo. This work was supported by NIH MH61492, NIH MH60013, and NSF IBN9996177.
References Bragin, A., Jando, G., Nadasdy, Z., Hetke, J., Wise, K., & Buzsaki, G. (1995). Gamma (40–100 Hz) oscillation in the hippocampus of the behaving rat. Journal of Neuroscience, 15, 47–60. Brankack, J., Stewart, M., & Fox, S. E. (1993). Current source density analysis of the hippocampal theta rhythm: Associated sustained potentials and candidate synaptic generators. Brain Research, 615(2), 310–327. Buno, W., Jr., & Velluti, J. C. (1977). Relationships of hippocampal theta cycles with bar pressing during self-stimulation. Physiol. Behav., 19(5), 615–621. Buzsaki, G., Czopf, J., Kondakor, I., & Kellenyi, L. (1986). Laminar distribution of hippocampal rhythmic slow activity (RSA) in the behaving rat: Currentsource density analysis, effects of urethane and atropine. Brain Res., 365(1), 125–137. Buzsaki, G., Leung, L. W., & Vanderwolf, C. H. (1983). Cellular bases of hippocampal EEG in the behaving rat. Brain Res., 287(2), 139–171. Fox, S. E. (1989). Membrane potential and impedence changes in hippocampal pyramidal cells during theta rhythm. Exp. Brain Res., 77, 283–294. Fox, S. E., Wolfson, S., & Ranck, J. B. J. (1986). Hippocampal theta rhythm and the ring of neurons in walking and urethane anesthetized rats. Exp. Brain Res., 62, 495–508. Frank, L. M., Brown, E. N., & Wilson, M. (2000). Trajectory encoding in the hippocampus and entorhinal cortex. Neuron, 27(1), 169–178. Givens, B. (1996). Stimulus-evoked resetting of the dentate theta rhythm: Relation to working memory. Neuroreport, 8, 159–163.
A Proposed Function for Hippocampal Theta Rhythm
815
Golding, N. L., & Spruston, N. (1998). Dendritic sodium spikes are variable triggers of axonal action potentials in hippocampal CA1 pyramidal neurons. Neuron, 21(5), 1189–2000. Green, J. D., & Arduini, A. A. (1954).Hippocampal electrical activity and arousal. J. Neurophysiol., 17, 533–577. Gustafsson, B., Wigstrom, H., Abraham, W. C., & Huang, Y. Y. (1987). Longterm potentiation in the hippocampus using depolarizing current pulses as the conditioning stimulus to single volley synaptic potentials. J. Neurosci., 7(3), 774–780. Hampson, R. E., Jarrard, L. E., & Deadwyler, S. A. (1999). Effects of ibotenate hippocampal and extrahippocampal destruction on delayed-match and nonmatch-to-sample behavior in rats. J. Neurosci., 19(4), 1492–1507. Hampson, R. E., Simeral, J. D., & Deadwyler, S. A. (1999). Distribution of spatial and nonspatial information in dorsal hippocampus. Nature, 402(6762), 610– 614. Hasselmo, M. E., Wyble, B. P., & Wallenstein, G. V. (1996). Encoding and retrieval of episodic memories: Role of cholinergic and GABAergic modulation in the hippocampus. Hippocampus, 6(6), 693–708. Holscher, C., Anwyl, R., & Rowan, M. J. (1997). Stimulation on the positive phase of hippocampal theta rhythm induces long-term potentiation that can be depotentiated by stimulation on the negative phase in area CA1 in vivo. J. Neurosci., 17(16), 6470–6477. Huerta, P. T., & Lisman, J. E. (1993). Heightened synaptic plasticity of hippocampal CA1 neurons during a cholinergically induced rhythmic state. Nature, 364, 723–725. Huerta, P. T., & Lisman, J. E. (1995). Bidirectional synaptic plasticity induced by a single burst during cholinergic theta oscillation in CA1 in vitro. Neuron, 15(5), 1053–1063. Kali, S., & Dayan, P. (2000). The involvement of recurrent connections in area CA3 in establishing the properties of place elds: A model. J. Neurosci., 20(19), 7463–7477. Kamondi, A., Acsady, L., Wang, X. J., & Buzsaki, G. (1998). Theta oscillations in somata and dendrites of hippocampal pyramidal cells in vivo: Activitydependent phase-precession of action potentials. Hippocampus, 8(3), 244–261. Leung, L.-W. S. (1984). Model of gradual phase shift of theta rhythm in the rat. J. Neurophysiol., 52, 1051–1065. Macrides, F., Eichenbaum, H. B., & Forbes, W. B. (1982). Temporal relationship between snifng and the limbic theta rhythm during odor discrimination reversal learning. J. Neurosci., 2, 1705–1717. Markowska, A. L., Olton, D. S., Murray, E. A., & Gaffan, D. (1989). A comparative analysis of the role of fornix and cingulate cortex in memory: Rats. Exp. Brain Res., 74(1), 187–201. McNaughton, B. L., Barnes, C. A., & O’Keefe, J. (1983). The contributions of position, direction, and velocity to single unit activity in the hippocampus of freely-moving rats. Exp. Brain Res., 52, 41–49. M’Harzi, M., Palacios, A., Monmaur, P., Willig, F., Houcine, O., & Delacour, J. (1987). Effects of selective lesions of mbria-fornix on learning set in the rat. Physiol. Behav., 40, 181–188.
816
M. E. Hasselmo, C. Bodel Âon, and B. P. Wyble
Murray, E. A., Davidson, M., Gaffan, D., Olton, D. S., & Suomi, S. (1989). Effects of fornix transection and cingulate cortical ablation on spatial memory in rhesus monkeys. Exp. Brain Res., 74(1), 173–186. Numan, R. (1978). Cortical-limbic mechanisms and response control: A theoretical re-view. Physiol. Psych., 6, 445–470. Numan, R., Feloney, M. P., Pham, K. H., & Tieber, L. M. (1995). Effects of medial septal lesions on an operant go/no-go delayed response alternation task in rats. Physiol. Behav., 58, 1263–1271. Numan, R., & Klis, D. (1992). Effects of medial septal lesions on an operant delayed go/no-go discrimination in rats. Brain Res. Bull., 29, 643–650. Numan, R., & Quaranta, J. R. Jr. (1990).Effects of medial septal lesions on operant delayed alternation in rats. Brain Res., 531, 232–241. O’Keefe, J., & Recce, M. L. (1993).Phase relationship between hippocampal place units and the EEG theta rhythm. Hippocampus, 3, 317–330. Orr, G., Rao, G., Stevenson, G. D., Barnes, C. A., & McNaughton, B. L. (1999). Hippocampal synaptic plasticity is modulated by the theta rhythm in the fascia dentata of freely behaving rats. Soc. Neurosci. Abstr., 25, 2165 (864.14). Otto, T., & Eichenbaum, H. (1992). Neuronal activity in the hippocampus during delayed non-match to sample performance in rats: Evidence for hippocampal processing in recognition memory. Hippocampus, 2(3), 323–334. Paulsen, O., & Moser, E. I. (1998). A model of hippocampal memory encoding and retrieval: GABAergic control of synaptic plasticity. Trends Neurosci., 21(7), 273–278. Pavlides, C., Greenstein, Y. J., Grudman, M., & Winson, J. (1988). Long-term potentiation in the dentate gyrus is induced preferentially on the positive phase of theta-rhythm. Brain Res., 439(1–2), 383–387. Rudell, A. P., & Fox, S. E. (1984). Hippocampal excitability related to the phase of the theta rhythm in urethanized rats. Brain Res., 294, 350–353. Rudell, A. P., Fox, S. E., & Ranck, J. B. J. (1980). Hippocampal excitability phaselocked to the theta rhythm in walking rats. Exp. Neurol., 68, 87–96. Semba, K., & Komisaruk, B. R. (1984). Neural substrates of two different rhythmical vibrissal movements in the rat. Neuroscience, 12(3), 761–774. Skaggs, W. E., McNaughton, B. L., Wilson, M. A., & Barnes, C. A. (1996). Theta phase precession in hippocampal neuronal populations and the compression of temporal sequences. Hippocampus, 6, 149–172. Sohal, V. S., & Hasselmo, M. E. (1998). Changes in GABAB modulation during a theta cycle may be analogous to the fall of temperature during annealing. Neural Computation, 10, 889–902. Stewart, M., & Fox, S. E. (1990). Do septal neurons pace the hippocampal theta rhythm? Trends Neurosci., 13, 163–168. Stewart, M., Quirk, G. J., Barry, M., & Fox, S. E. (1992). Firing relations of medial entorhinal neurons to the hippocampal theta rhythm in urethane anesthetized and walking rats. Exp. Brain Res., 90(1), 21–28. Toth, K., Freund, T. F., & Miles, R. (1997). Disinhibition of rat hippocampal pyramidal cells by GABAergic afferent from the septum. J. Physiol., 500, 463– 474.
A Proposed Function for Hippocampal Theta Rhythm
817
Wallenstein, G. V., & Hasselmo, M. E. (1997). GABAergic modulation of hippocampal population activity: Sequence learning, place eld development and the phase precession effect. J. Neurophysiol., 78(1), 393–408. Whishaw, I. Q., & Tomie, J. A. (1997). Perseveration on place reversals in spatial swimming pool tasks: Further evidence for place learning in hippocampal rats. Hippocampus, 7(4), 361–370. Wiener, S. I., Paul, C. A., & Eichenbaum, H. (1989). Spatial and behavioral correlates of hippocampal neuronal activity. J. Neurosci., 9(8), 2737–2763. Wood, E. R., Dudchenko, P. A., Robitsek, R. J., & Eichenbaum, H. (2000). Hippocampal neurons encode information about different types of memory episodes occurring in the same location. Neuron, 27(3), 623–633. Wyble, B. P., Hyman, J. M., Goyal, V., & Hasselmo, M. E. (2001). Phase relationship of LTP inducation and behavior to theta rhythm in the rat hippocampus. Soc. Neurosci. Abstr., 27. Wyble, B. P., Linster, C., & Hasselmo, M. E. (2000). Size of CA1-evoked synaptic potentials is related to theta rhythm phase in rat hippocampus. J. Neurophysiol., 83(4), 2138–2144. Young, B. J., Otto, T., Fox, G. D., & Eichenbaum, H. (1997). Memory representation within the parahippocampal region. J. Neurosci., 17, 5183–5195. Received September 28, 2000; accepted June 15, 2001.
LETTER
Communicated by Andrew Barto
Self-Organization in the Basal Ganglia with Modulation of Reinforcement Signals Hiroyuki Nakahara
[email protected] Shun-ichi Amari
[email protected] Laboratory for Mathematical Neuroscience, RIKEN Brain Science Institute 2-1 Hirosawa, Wako, Saitama, 351-0198, Japan Okihide Hikosaka
[email protected] Department of Physiology, School of Medicine, Juntendo University, 2-1-1 Hongo, Bunkyo, Tokyo 113-0033, Japan
Self-organization is one of fundamental brain computations for forming efcient representations of information. Experimental support for this idea has been largely limited to the developmental and reorganizational formation of neural circuits in the sensory cortices. We now propose that self-organization may also play an important role in short-term synaptic changes in reward-driven voluntary behaviors. It has recently been shown that many neurons in the basal ganglia change their sensory responses exibly in relation to rewards. Our computational model proposes that the rapid changes in striatal projection neurons depend on the subtle balance between the Hebb-type mechanisms of excitation and inhibition, which are modulated by reinforcement signals. Simulations based on the model are shown to produce various types of neural activity similar to those found in experiments. 1 Introduction
The basal ganglia (BG) are well known to contribute to sequential motor and cognitive behaviors (Knopman & Nissen, 1991; Graybiel, 1995). Almost the entire cerebral cortex projects to the BG, and the BG project mainly back to the frontal cortex through the thalamus and to the superior colliculus. A striking fact about the BG is a vast convergent projection from the cerebral cortex to the striatum, a major input zone of the BG (Oorschot, 1996) and another convergent projection from the striatum to the output nuclei of the BG, that is, the global pallidus internal segments (GPi), and the substantia nigra pars reticulata (SNr). Given this fact, it is expected that the BG Neural Computation 14, 819–844 (2002)
° c 2002 Massachusetts Institute of Technology
820
H. Nakahara, S. Amari, and O. Hikosaka
have efcient representations of cortical inputs to interact effectively with the cerebral cortex (Graybiel, Aosaki, Flaherty, & Kimura, 1994). We should note that the majority of the neurons in the striatum, including the projection neurons and some types of interneurons, are GABAergic, working most likely as inhibitory within the striatum as well as to target neurons in projection areas (Kita, 1993; Kawaguchi, Wilson, Augood, & Emson, 1995; Wilson, 1998). The BG, the striatum in particular, receive rich reinforcement signals from dopaminergic (DA) neurons, which originate in the substantia nigra pars compacta (SNc). It has been observed that the neural responses in the striatum are strongly modulated by DA neurons or reward conditions (Aosaki et al., 1994; Schultz, Apicella, Romo, & Scarnati, 1995; Kawagoe, Takikawa, & Hikosaka, 1998). DA neurons exhibit a phasic activity when an unexpected reward occurs or when a conditioning stimuli appears that allows the subject to anticipate a coming reward (Schultz, 1998). Driven by these reinforcement signals carried by DA neurons, the striatum has been
Figure 1: Facing page. (A) Example of the memory-guided saccade task in a onedirection-rewarded condition (1DR). In experiments, there were two reward conditions: 1DR and ADR (all-direction-rewarded condition) conditions. The task procedure in each trial is the same as a memory-guided saccade task between both conditions except reward conditions: a task trial started with the onset of a central xation point, which the monkeys had to xate. A cue stimulus (spot of light) then came at one of the four directions. After the xation point turned off, the monkeys had to make a saccade to the cued location. In ADR, all directions are rewarded after a saccade to the cued location in each trial. In 1DR, throughout a block of the experiment (60 trials), only one direction was rewarded among four directions. In the example shown here, the right direction is rewarded. Even for nonrewarded directions in 1DR, the monkeys had to make correct saccades; otherwise, the same trial was repeated. 1DR was performed in four blocks, in each of which a different direction was rewarded. Other than the actual reward, no indication was given to the monkeys as to which direction was to be rewarded. (B) Examples of the three types of neural responses observed in the experiment are shown over four 1DR blocks (taken from Kawagoe et al., 1998). Data obtained in each block of 1DR (left) are shown as a polar diagram indicating the magnitudes of the responses for four cue directions. Rewarded direction is indicated by a bull’s-eye mark. (Top) Flexible type (the most frequently observed type) changes its preferred direction quickly to the rewarded direction in each 1DR block. (Middle) Conservative type maintains its preferred direction (rightward for this neuron) across 1DR blocks (at least in two of four 1DR blocks), while the response toward each direction is enhanced when it is rewarded in each 1DR block. (Bottom) Reverse type (less frequently observed than the other two types) shows the smallest response to the rewarded direction in each 1DR block.
Self-Organization in the Basal Ganglia
821
considered to undergo heterosynaptic plasticity (Calabresi, Maj, Pisani, Mercuri, & Bernardi, 1992; Wickens & Kotter, ¨ 1995) and to contribute to skill memory formation through the cortico-basal ganglia loops (Marsden, 1980; Alexander, Crutcher, & DeLong, 1990; Knowlton, Mangels, & Squire, 1996; Hikosaka et al., 1999; Nakahara, Doya, & Hikosaka, 2001). In a recent experiment (Kawagoe et al., 1998), reward-modulated changes of neural responses in the caudate, a part of the striatum, were investigated in a systematic manner, using asymmetrically rewarded memory-guided saccade tasks, in which one of the four directions was randomly chosen as the saccade target in a trial and the monkeys had to make a memory-guided saccade in the trial (see Figure 1A). In this task, there were two reward
A
No Reward
No Reward
Reward
No Reward
B Flexible
Conservative
Reverse
822
H. Nakahara, S. Amari, and O. Hikosaka
conditions: all-directions-rewarded condition (ADR), where all four directions were rewarded in a block of experiments after a correct saccade in a trial, and one-direction-rewarded condition (1DR), where only one xed direction was rewarded after a correct saccade in a trial (the other three directions were not rewarded throughout a block even when a correct saccade was made). In visual and memory-related periods, the preferred directions of the caudate neural responses, observed in ADR (Kawagoe et al., 1998), were typically found to be contralateral, as previously reported (Hikosaka, Sakamoto, & Usui, 1989), so that the caudate neurons exhibited spatial-directional selective responses. The caudate neural responses, however, were strongly modulated by the rewarded directions in the 1DR blocks (see Figure 1B). Three typical patterns were found in the response changes: exible, conservative, and reverse type patterns (see Figure 1B; their denitions are provided in the legend). When a 1DR block is altered, the caudate responses change within ten to a few tens of trials and develop much slower than the DA activities (Kawagoe et al., 1998; Kawagoe, Takikawa, & Hikosaka, 1999) (see section 2.1). Such changes are supposed to be caused by the synaptic plasticity under the inuence of the DA activity (Calabresi et al., 1992; Wickens & Kotter, ¨ 1995; Reynolds & Wickens, 2000), while the DA-induced changes in the internal states of the striatal neurons may play a role as well (Surmeier, Song, & Yan, 1996). These results suggest that
Figure 2: Facing page. (A) Schematic diagram of the relationship of the cerebral cortex and the caudate. Spatial information is conveyed from the cerebral cortex to the caudate, while reinforcement signals are provided via dopaminergic projections (DA). Black and white arrowheads indicate excitatory and inhibitory connections, respectively. (B) Scheme of self-organization in the caudate. Directional information is topographically represented in the cerebral cortex with two components: one reecting each directional selectively (Ni ) and the other reecting some overlap between different directions (M). Common inputs (M) are shared by cortical representations for different directions (xi ), and in the gure, cortical representations for two directions are shown. Reinforcement signals (a) carried by dopamine (DA) neurons inuence cortical inputs. An integrated inhibitory input to caudate neurons is denoted by x 0 . (C) Schematic example of a conservative-type neuron at equilibrium, responding to its preferred (but nonrewarded) direction (x1 ) and a rewarded direction (x2 ). This neuron does not re to x3 (x4 is dropped here for simplicity). After the learning converges, the synapN D 12 ( x1 C x2 ) . By assuming k w kD 1, each of the inner tic weight converges to w products between the input (xi ) and the weight (w N ) with and without reinforcement signal modulation is indicated on the right. Inhibitory effect is summarized N ¢ x3 < l < w N ¢ x1 , w N ¢ x2 (1 C a2 ) , by l, which is also indicated on the right. Since w the neuron responds to x1 , and x2 with reinforcement signal, but not to x3 . If a2 is not provided, the neuron does not respond to x2 . The dashed cone region indicates the receptive region of this neuron. The neuron responds to all of the inputs, which can be modulated by reinforcement signals, in this region.
Self-Organization in the Basal Ganglia
823
the caudate neurons rapidly alter their response properties guided by the DA activity when the reward condition changes (see the next section and section 4). Importantly, the pattern of changes in a response can vary in each neuron and is classied roughly into one of three categories. We propose that the caudate neurons self-organize their responses under the control of both the intrinsic cortical inputs representing each cue direction and the DA reinforcement signals reecting the reward conditions (Schultz, 1998; see Figure 2A). We therefore study a simplied self-organization model with
A
Cerebral Cortex Spatial information
Caudate
Sp Reinforcement signal
DA
B
Cortex
x
i
x0
Caudate
DA
C Receptive region
x1 w
(1+ )x2
x3 w .x 3
(1+ ) w .x2
w .x1
824
H. Nakahara, S. Amari, and O. Hikosaka
reinforcement signals (see Figure 2B).It is possible to analyze its behaviors rigorously by giving conditions that guarantee the requested behaviors (Kawagoe et al., 1998). Our model includes plastic changes in both the excitatory and inhibitory synapses, and their subtle balance generates a variety of phenomena, as seen in experiments. Under each condition, we can determine which pattern of neural responses appears, using the parameters of both the reinforcement signal and the inhibitory effect with respect to the cortical representations. Our model can explain these typical patterns of modulation by the same simple mechanism in a unied way, which would help us predict the relationship between the neural response modulations and the cortico-striatal and nigra-striatal connections of the neurons. 2 A Theoretical Model 2.1 A Self-Organization Neuron Model. Emergence of various types of self-organization has been studied previously in a unied manner (Amari, 1977, 1983; Amari & Takeuchi, 1978). Our model of the striatum neurons is a new version in that the internal state of a neuron u ( t) at time t is enhanced by the reinforcement signal a(t) , u ( t ) D w ( t ) ¢ x ( t ) f1 C a(t) g ¡ w 0 ( t) x 0 (t ) ,
(2.1)
where the vector x stands for excitatory cortical inputs to a neuron in the striatum, corresponding to the cue signal, which is one of the four directions, while x0 summarizes a population of the inhibitory inputs as a single variable in this model and is assumed to be a constant for simplicity in the later analysis. Here, w denotes the weights of the cortico-striatal connections of this neuron for x, and w 0 denotes the weight for the inhibitory input x 0. The term a(t) denotes the effect of the reinforcement signal, carried by dopaminergic (DA) neurons, on the striatal projection neuron ring rates. Experimentally, the DA effects on striatal ring rates have been found to be facilitatory (i.e., a > 0 in our model) or inhibitory (a < 0). Facilitatory and inhibitory effects may be mediated by D1 and D2 receptors, respectively (Gerfen, 1992; Cepeda, Buchwald, & Levine, 1993). Alternatively, both effects can be mediated by the bistable nature of D1 receptors (Nicola, Surmeier, & Malenka, 2000; Gruber, 2000). The phasic DA activities occur in general to an unexpected reward or a conditioning stimulus preceding a rewards r, once the conditional relationship is well established (Schultz, Apicella, & Ljungberg, 1993). While this phasic DA activity is hypothesized to provide a reward prediction error, the striatal neurons are considered to change their neural activities, using this reinforcement signal by the DA neurons (Schultz, 1998). Correspondingly, experiments in 1DR and ADR (Kawagoe et al., 1999) have shown that there is the phasic DA activity locked with a rewarded direction cue in each 1DR (ADR) block. When a new 1DR block is started, the phasic DA activity quickly shifts to the rewarded cue within a few trials and stays unchanged throughout the block, possibly having a very small, but negligible, decay toward the end of the block (Kawagoe et al., 1999). Accordingly, we denote the phasic DA activity by a and assume
Self-Organization in the Basal Ganglia
825
that a changes without delay between experimental blocks and stays the same throughout a block. The caudate neurons also change their activities when a 1DR block is altered. Interestingly, the changes in the caudate neural activities are much slower, taking a few tens of trials, than those in the DA activity. This suggests that the plastic caudate activities may be induced by synaptic changes in the cortico-caudal projection, under the DA modulation, rather than by the direct DA-induced changes in the caudate response properties, although the latter effect may also play a role, particularly in initiating such plastic caudate activities (see sections 3.1 and 4). Let us now look into the time course of the DA activity in a single trial. The phasic DA activity (a) starts to rise roughly around 100 milliseconds after the cue presentation, when it is rewarded. Next, it starts to drop, rst rapidly, until around 400 milliseconds and then gradually to the resting level around 700 milliseconds, during which a very weak, but still larger than at the resting level, DA activity exists (Kawagoe et al., 1999; also see Schultz et al., 1993; Schultz, Romo, Ljungberg, Minenowicz, Hollerman, & Dickenson, 1995). To distinguish from the phasic peak activity a, we denote this weak DA activity by a0 , which can be regarded as nearly, but not exactly, zero (i.e., at the resting level) (therefore, a0 » 0). In contrast, the post-cue caudate neural activities, depending on each neuron, are conned not only during the time of the phasic DA period but also during the time of the weak DA activity (and even later) (Kawagoe et al., 1998, 1999). In other words, the plastic changes in the caudate neural activities possibly carry over not only in the time of a but also in the time of a0 at least, suggesting that the DA-modulated synaptic changes induce the plastic changes in the caudate neural activities. Note that when a(t) D 0, the neuron model in Eq. (2.1) is equivalent to the primitive self-organization model (Amari & Takeuchi, 1978). When the reinforcement signal exists (i.e., |a| > 0), the internal state of the neuron u ( t ) increases or decreases, and controls the self-organizing process, so that the reinforcement signal works in a modulatory manner. The output y ( t ) of the neuron is given by the transfer function y ( t) D f fu(t) g.
(2.2)
To obtain an analytical solution, we rst choose f (¢) as the step function, given as 1 ( u) D 1 (if u > 0), and 0 otherwise. This is the simplest choice, allowing us to derive an explicit analytical solution for the dynamical behavior of learning in the model. This simplication implies that the state of the neuron is binary, that is, the neuron res (y D 1) or does not (y D 0). Later, we let f (¢) be the sigmoid function f ( a ) D 1 / (1 C e¡c a ) where a is a real value and c is a scaling parameter. The dynamics of a Hebbian-type learning rule is chosen to show changes in synaptic efcacies,
»
P ( t) D ¡w ( t) C cy ( t) x ( t) tw t wP 0 ( t ) D ¡w 0 ( t ) C c0 y ( t ) x0 ,
(2.3)
P and wP 0 represent the time derivatives dw / dt and dw0 / dt, and c and c0 where w are learning rates. In equation 2.3, y ( t) depends on a(t) , because y ( t ) D f fu(t) g
826
H. Nakahara, S. Amari, and O. Hikosaka
and u ( t ) is a function of x ( t) and a(t) . In other words, DA signals, particularly the phasic ones, initiate and guide the learning process, where the phasic activity (a) quickly changes to the weaker one (a0 » 0) in the time course of a trial. In the rst equation, the second term yx is Hebbian, indicating that the weight increases in proportion to the input x when the output y of the neuron is positive. The above model is different from the ordinary Hebb-type neuron in that the inhibitory weight w0 is also modiable, as shown in the second equation (Amari & Takeuchi, 1978; Amari, 1983). The intrinsic mechanism of the model is based on the balance between the excitatory and inhibitory effects, mediated by modiable weights under the initial modulation of the reinforcement signal. To understand the behavior of models of this kind, we need to examine their dynamical behaviors of learning, typically the equilibrium states and their stability under a stationary environment from which input signals x ( t ) are supplied. Many stable equilibrium states exist in general, and this multistability is important to allow an ensemble of neurons, exposed to the same environment, to differentiate with different neural responses and capture various features of the environment. 2.2 Analysis of Learning Dynamics. When inputs xi are presented with probabilities pi ´ p ( xi ) ( i D 1, 2, 3, 4) and reinforcement signals a( xi ) , which quickly change into a0 ( xi ) , the averaged learning equation is given by
P
»
P D ¡w ( t ) C c i pi yi xi tw P P 0 D ¡w 0 ( t) C c0 i pi yi x 0 . tw
(2.4)
N,w The synaptic weights converge to the equilibrium state ( w N 0 ) , satisfying
P Dw P 0 D 0: w P » w N D c i pi yi xi P 0 wN0 D c
i
pi yi x 0 .
(2.5)
Equation 2.5 is not the explicit solution for w N and w N 0 , since yi on the right-hand N and wN0 , so that it is the equation to be solved for them. After side depends on w the weights have converged by learning, the internal state of the neuron for input x with accompanying reinforcement signal a (or a0 ) is N ¢ x (1 C a) ¡ wN0 x 0. uN D w
(2.6)
R D fx | uN ( x ) > 0g,
(2.7)
If uN > 0 for input x, this neuron res. In order to analyze the characteristic of this neuron, we dene the receptive eld of this neuron R as
and using R,
X pR D
³ wR D
(2.8)
pi
xi 2R X
xi 2R
´ pi xi
/ pR .
(2.9)
Self-Organization in the Basal Ganglia
827
Intuitively speaking, pR stands for the probability mass of the signals in region R, and w R stands for the center of gravity in region R. Recall that yi D 1 ( u ( xi ) ) P so that yi xi reduces to the sum over all x in the receptive eld R. Hence, equation 2.6 can be rewritten as
³ uN j D uN ( xj ) D cpR (1 C aj ) wR ¢ xj ¡
c0 c (1 C aj )
´ x20
,
(2.10)
where aj D a( xj ) . When xj is rewarded, aj D a in the beginning and becomes a0 later. Therefore, by dening l´
c0 2 x, c 0
(2.11)
the condition for a neuron to re in response to xj is given by K ( xj ) > 0, where K ( xj ) ´ wR ¢ xj ¡
l , 1 C aj
(2.12)
and the condition to be silent in response to xj is given by K ( xj ) · 0. The two conditions are the necessary condition but are not sufcient. Note that in the later period of a trial, i.e., when aj D a0 , the neuron still res and K ( xj ) > 0 for aj D a0 ¼ 0. Thus, equation 2.12 provides a mathematical criterion for studying the receptive region generated by learning. We can see that l is the important parameter that controls the size of R. As l increases, the receptive eld becomes smaller. To ensure sufciency, we need to check for 8x 2 R,
K ( x ) > 0,
and
6 R, for 8x 2
K ( x ) · 0.
This is because wR depends on R, and R is dened as the set of inputs x by which the neuron should re (Amari & Takeuchi, 1978; Amari, 1983). 2.3 Analysis of Caudate Neurons. The emergence of the three response types of neurons is explained here. There are two types of 1DR blocks in the actual experiment: an exclusive 1DR and a relative 1DR (Kawagoe et al., 1998). We discuss only the former here, because our analysis can be easily applied to the latter with slight modications. The cortical inputs fxi g4iD 1 , corresponding to the four cue directions, are given with probability p ( xi ) D 14 in the experiment (Kawagoe et al., 1998). In the exclusive 1DR, only one of the four directions is a rewarded direction (say, xi with the reward ri D r, r > 0); the other three directions are not rewarded 6 i). When one block of experiments starts, we have ai D a (|a| > 0; a (rj D 0, j D 6 becomes a0 shortly in each trial) and aj D 0 ( j D i) , respectively, as the effect of the phasic reinforcement signal by DA neurons on the striatal projection neuron (see section 2.1). We now describe how the four cue directions are represented in the cortical signal x to a caudate neuron in our model. The signal consists of a bundle of inputs in which some parts are common to all of the cue directions, just representing the appearance of a cue, and the other parts include specic exclusive
828
H. Nakahara, S. Amari, and O. Hikosaka
information for each direction. Let M be the number of common inputs, and let Ni be the number of inputs specic to xi . We rearrange the components such that the common M inputs appear rst. We then have the following representation, M
N1
N2
N3
N4
z }| {
z }| {
z }| {
z }| {
z }| {
x2 D (1, . . . , 1,
0, . . . , 0,
1, . . . , 1,
0, . . . , 0,
0, . . . , 0)
x3 D (1, . . . , 1,
0, . . . , 0,
0, . . . , 0,
1, . . . , 1,
0, . . . , 0)
x4 D (1, . . . , 1,
0, . . . , 0,
0, . . . , 0,
0, . . . , 0,
1, . . . , 1) .
x1 D ( 1, . . . , 1,
1, . . . , 1,
0, . . . , 0,
0, . . . , 0,
0, . . . , 0)
(2.13)
The inner products among fxi g4iD 1 are given by
»
xi ¢ xj D
M C Ni M
(i D j) (i D 6 j) .
(2.14)
This denition of the cortical representations for each direction is only for presentational simplicity. The following theorems can be proved, as is evident in their proofs, with a general denition of inner products,
xi ¢ xj D gij , if we wish. Through this general form of the inner product, the magnitude of the cortical representation for each direction is determined, and furthermore, the proximity between the cortical representations for different directions is determined. These properties are essential to determine each response type with the other two parameters, l and a, as shown in the three theorems below, where equation 2.14 is used. A neuron behaves as if the exible type in all four 1DR blocks if its parameters satisfy Theorem 1.
MC
Nmax · l < minfM(1 C a) , ( M C Nmin ) (1 C a0 ) g, 2
(2.15)
where Nmax D maxi2I Ni , N min D mini2I Ni and I D f1, 2, 3, 4g.
Proof. Without loss of generality, we assume that the rewarded direction is 6 x1 and, hence, a1 D a (a > 0) in the beginning and aj D 0 ( j D 1). We assume that the initial weight w is large enough such that the neuron is excited by x1 . We search for the condition that R D fx1 g is an equilibrium state of the equation, implying that the neuron is excited only by x1 . If this is the case, pR D 14 and 6 w1R D x1 . In equation 2.12, we further require K ( x1 ) > 0 and K (xi ) · 0 ( i D 1) even when a is reduced to a0 in the process. They are equivalently rewritten as M · l < ( M C N1 ) (1 C a0 ) . Now suppose that the reward direction is changed to x2 in the next block. We need to obtain the condition to ensure that the neuron changes its behavior
Self-Organization in the Basal Ganglia
829
to respond only to x2 . By noting that the initial state of the weight vector of the neuron in the new block is the same as the nal equilibrium state of the weights in the previous block, we require w1R ¢ x2 ¡ 1Cl a > 0 for the neuron to respond to x2 in the beginning of the new block, where the reward direction changed and a2 D a. This gives l < M (1 C a) . Even when this condition is satised, it may happen that this neuron responds to both x1 and x2 in the beginning of the second block, though x1 is no more rewarded, that is, a1 D 0. However, this changes as learning takes place in the new situation. We then need a condition such that while the neuron may initially respond to both x1 and x2 , the neuron stops responding to x1 before reaching the equilibrium state of R D fx2 g. This condition is given by MC
N2 · l. 2
By summarizing the above conditions and taking all 1DR blocks into account, the theorem is proved. The theorems for the other two types are given as follows. A neuron behaves as if the conservative type in all four 1DR blocks if its parameters satisfy Theorem 2.
MC
0 Nmax 2
»
· l < min M (1 C a) ,
³ MC
0 Nmin 2
´ (1 C a0 ) , M C
N1 2
¼ ,
(2.16)
where we assume that x1 is the preferred direction (i.e., N1 D Nmax ) and we have 0 0 Nmax D maxi2I 0 Ni , N min D min i2I 0 Ni and I0 D f2, 3, 4g. A neuron behaves as if the reverse type in all four 1DR blocks if its parameters satisfy Theorem 3.
³ MC
Nmax 4
´ (1 C a0 ) · l < M,
(2.17)
where Nmax D maxi2I Ni and a < a0 < 0. See the appendix for the proofs of theorems 2 and 3. Figure 3A demonstrates each response type proved in the three theorems above. Note that a binary output neuron is used in the theorems, since the step function, as the transfer function of neurons, allows neurons only to re or to be silent. Hence, each response type is dened accordingly: Flexible-type neurons respond only for a rewarded direction in each 1DR block; conservative-type
830
H. Nakahara, S. Amari, and O. Hikosaka
A Flexible
Conservative
Reverse
B Flexible
Conservative
Reverse
Figure 3: The three types of simulated neural responses shown over four 1DR blocks. The same parameter setting is used in A and B except transfer functions. (A) Step function is used as the transfer function to generate a neural output. (B) Sigmoid function as the transfer function.
Self-Organization in the Basal Ganglia
831
neurons respond for both rewarded and their intrinsic preferred directions in each 1DR block; reverse-type neurons respond for nonrewarded directions in each 1DR block. The three types proved in the theorems are chosen to stand for the most typical types of neural response changes across 1DR blocks observed in experiments. At the same time, in the rich variety of experimental neural responses, there are some types that are not included in the three types and other types that can be considered as subtypes in one of the three types. For example, a very few neurons behave as if they were reverse-conservative type, which has a larger response to each direction when it is not rewarded, while its intrinsic preferred direction is somewhat maintained in each 1DR block. Other neurons behave as if they were the super-conservative type, which has a response almost only for its intrinsic preferred direction in any 1DR block. It is possible to prove the conditions for these types, but we focus on the typical three types for clarity in this study. 2.4 Fine Characteristics of Neural Responses. The step function allows us to obtain exact analytical solutions; however, simulation results based on the step function differ from experimental ones in that neurons can only be binary mode: to re or to be silent (Figure 3A). To simulate experimental results further, an analog sigmoid function can be employed, by which a normalized mean ring rate can be represented (Figure 3B). In this alteration of the transfer function, qualitative behaviors are expected to be similar because the sigmoid function becomes similar to the step function as the steepness increases. Once the sigmoid function is used, the difference in the magnitudes of the responses ( y D f ( u ) ) can be reected in the difference in the internal states (i.e., u). Hence, by adjusting the proximity of the cortical inputs (i.e., the inner product in equation 2.14 or gij in general), the ne characteristics in 1DR responses can be represented. In other words, the directional selectivity in the caudate neural responses can be reected (see Figure 3B; for example, the exible type is set as having the leftward preferred direction). For exible-type neurons in experiments, the response to each direction in ADR tends to be smaller than the corresponding rewarded response in 1DR. This tendency is observed in the model. We rst note that the total amount of rewards in one block was the same in ADR and 1DR in experiments (Kawagoe et al., 1998). In other words, the amount of rewards in each reward direction is four times larger in any 1DR block than in ADR. Under this condition, our model shows the observed tendency that neural responses become less distinctive in ADR than in 1DR. Given the step function as the transfer function, the condition for neurons of the exible type to have responses in ADR as well as responses in 1DR is summarized by
»
MC
³
N N < l < min M (1 C a) , M C 2 4
´³
1C
a0 4
´¼
,
where we set N D Ni for simplicity. The difference of the internal states in 1DR and ADR (uN 1DR and uN ADR ) is given as uN 1DR ¡ uN ADR D
3c ( Na C 4l ¡ 4M) . 16
832
H. Nakahara, S. Amari, and O. Hikosaka
From the above two equations, we can conclude uN 1DR ¡ uN ADR > 0, that is, we nd the above-mentioned tendency in our model. However, the observed tendency might simply be due to the difference in the amount of rewards per trial. But preliminary experimental results suggest that when a two-directional version of 1DR task is employed, the response amplitude for one direction is inuenced by the other chosen direction, indicating an interactive effect between the different directional inputs on the response amplitudes (Takikawa, Kawagoe, & Hikosaka, 2001). Hence, the tendency may not be a simple result of the difference in the amount of reward per a trial. 3 Remarks 3.1 Neuron Model and Learning Rule. The neuron model in this study is given by equations 2.1 and 2.2, while the learning rule is given by equation 2.3. In this formulation, the reinforcement signal is not directly evident in the learning rule but has an indirect effect on the learning through y ( t ) D f ( u ) D f [w ( t ) ¢ x ( t ) f1 C a(t) g ¡ w 0 ( t ) x 0 ( t) ]. Accordingly, the DA-induced modulation a inuences the neural state u at rst. It then works as a modulator, indirectly affecting the learning process of the synaptic weights w and w 0. DA-induced changes in a neural state (see equation 2.1) eventually lead to selective changes of the synaptic efcacy through the learning process in the model (see equation 2.3). This may illustrate the issue on whether short-latency DA responses are for reinforcement learning or attentional switching (Redgrave, Prescott, & Gurney, 1999). In our model, when a new phasic DA response occurs to an unexpected reward, the phasic DA activity directly affects the striatal neural response properties (attentional switching) and consequently initiates a new learning process (“reinforcement learning”). 3.2 Emergence of Three Neural Response Types. The results in the previous section demonstrate that all three types of neuronal behaviors emerge from the same mechanism, depending on the values of the underlying parameters (see Figure 3A). Table 1 summarizes the conditions for each type of neuron. All of the conditions are expressed by the two factors, the term l and the reinforcement signal a, with respect to the cortical representations (i.e., M, Ni ). The term l D ( c0 / c) x20 (see equation 2.11) is composed of the learning rates for the excitatory and inhibitory weights (i.e., c and c0 ) and the magnitude of inhibitory input x 0, and, roughly speaking, it summarizes the efcacy of learning in the inhibitory effect relative to that in the excitatory effect (see Figures 2B and 2C) (see the next section). The term a indicates the effect of the reinforcement signal, carried by DA neurons, on the striatal ring rate. As shown in Table 1, the analysis indicates
Self-Organization in the Basal Ganglia
833
Table 1: Conditions for Different Types of the Caudate Neurons. Neuron Type Flexible
Condition MC
Conservative M C
Nmax 2
0 Nmax 2
· l < minfM(1 C a) ,
( M C Nmin ) (1 C a ) g
· l < minfM(1 C a) ,
(M C
Reverse
(M C
Nmax 4
a ¸ a0 > 0
0
Nmin 2
0
) (1 C a0 ) , M C
(N1 > Ni , i 2 I0 )
N1 2
g
) (1 C a0 ) · l < M
Nmax ´ maxi2I Ni ,
0 ´ maxi2I0 Ni , Nmax
Nmin ´ mini2I Ni , I ´ f1, 2, 3, 4g
a ¸ a0 > 0
a · a0 < 0
0 ´ mini2I0 Ni , I 0 ´ f2, 3, 4g Nmin M C Ni ( i D j ) 0 l ´ cc x20, xi ¢ xj ´ (i D 6 M j)
»
that the reinforcement signal a should work as facilitatory (a > 0) for the exible and conservative types and as inhibitory (a < 0) for the reverse type. In our simplied denition of the inner product of the cortical representations (see equations 2.14 and 2.13), M represents the overlap between the cortical inputs, while Ni represents a part specic to each directional input. More generally, M C Ni corresponds to the square of the magnitude of each direction 6 (i.e., | xi | 2 ) and M corresponds to xi ¢ xj D | xi || xj | coshxi , xj ( i D j ) , where hxi , xj is the angle between xi and xj . In Table 1, therefore, the terms such as M C Ni , M C N2i , and so on, express how the proximity in the cortical representations, or their overlap, inuences the emergence of each type. For example, a critical difference in the exible and conservative types is that a preferred cue direction input (x1 for N1 in Table 1) satises l < M C N21 for the conservative type but M C N21 < l for the exible type, while there are some other different conditions between the two types. Now, we can write M C N21 D 12 (k x1 k2 C x1 ¢ xi ) , where i 2 f2, 3, 4g. Hence, the term M C N21 is related to both the magnitude of the preferred cue direction input x1 and the angle between x1 and other cue direction inputs. When the cortical representations fxi g are chosen as in equation 2.13, the conditions in Table 1 explicitly relate each response type to the magnitudes of the cortical inputs and their overlap, which is essentially the number of cortico-striatal connections reecting each directional input (see equation 2.13). Since the cortical representations in equation 2.13 and their inner products in equation 2.14 used in Table 1 are only for presentation purposes, in a more general term, the conditions in Table 1 suggest that each response type is determined by the magnitudes and the proximity of the directional inputs conveyed through the cortico-striatal projections (i.e., via the inner product gij ) in relation to l and a. Our model may be generic enough to capture microscopic properties such as small cluster properties (Hikosaka et al., 1989; Flaherty &
834
H. Nakahara, S. Amari, and O. Hikosaka
Graybiel, 1994; Jaeger, Kita, & Wilson, 1994; Kincaid, Zheng, & Wilson, 1998) with a more detailed analysis on the cortical input proximities, along with 1DR experiments.
3.3 Effect of Inhibitory Weight Modiability. The term l, roughly speaking, summarizes the efcacy of inhibitory learning dynamics. If l is too small, all neurons start to re for any directional input at equilibrium, whereas all neurons become silent for any directional input at equilibrium if l is too large. For an ensemble of neurons to acquire discriminative power by self-organization, l should be set relevantly so that the neurons can start to respond to different stimuli with differential response properties, governed by their cortical input 0 proximities and reinforcement signals. Thus, while l D cc x20 is kept constant for each neuron in our simple model, neurons with different values of l lead to different response types. As an example, let us consider how a neuron of the exible type can maintain its response type in transition between 1DR blocks. In the transition, a neuron of the exible type should start to respond to a new rewarded direction in the beginning of the second block (see the proof of theorem 1). In this situation, the inhibitory effect w N 0x 0 in a neural state u is given as w N 0 x0 D c0 pR x20 D cpR l. Thus, if l is so large that it makes u D w N ¢ x (1 C a) ¡ w N 0 x0 < 0, then the neuron fails to respond to the rewarded direction in the transition. In this sense, l should be set relatively small (Jaeger et al., 1994). On the other hand, if l is so small that it allows a neuron to respond to both the previous reward direction and the current reward direction even at its equilibrium, then the neuron fails to maintain the response property of the exible type after the transition. Figure 4 shows an example of the correspondence of the three neural response types with different parameter values. In each graph, the regions for the exible, conservative, and reverse types are drawn with light gray, gray, and dark gray, respectively, and are determined by l and a for xed N max and N (or N / Nmax ) (in 6 1) for simplicity; see the other words, we set N1 D Nmax , say, and then Nj D N ( j D gure legend). In this gure, when Nmax is large, there is a small common input M and, when N / Nmax becomes smaller, the preferred direction input gets larger relative to the other direction inputs. There is a nonlinear effect to determine the region of each response type. Depending on Nmax and N, different values of a and l provide each response type. We note that the example shown in Figure 4 is just one example, chosen to indicate the coexistence of the three types under some parameter regime. For example, there can be a parameter regime where only the exible and inhibitory types, but not the conservative type, exist. The analysis in this study can predict possible changes in the self-organization of neural responses when inhibitory effects are altered. For example, decreasing l leads neurons to lose their directional-selective responses of the exible and conservative types (as exemplied in Figure 4). Decreasing l can be achieved in several ways, for example, by decreasing the inhibitory input (x 0). We wait for experimental examinations on this issue.
Self-Organization in the Basal Ganglia 2
1.5
835 2
N max = 0.3; N/Nmax = 0.7
1.5
1
1
0.5
0.5
0 1
0.5
0
0.5
1
2
0 1
max
= 0.7; N/N
max
= 0.7
N 1.5
1
1
0.5
0.5
0 1
0.5
0
0.5
1
0.5
1
2 N
1.5
N max = 0.3; N/Nmax = 0.3
0.5
0
0.5
1
0 1
max
= 0.7; N/N
0.5
max
0
= 0.3
Figure 4: Examples of correspondences of the three neural response types with different parameter regions. Flexible, conservative, and reverse type regions are drawn in light gray, gray, and dark gray, respectively. Graphs are generated by 0 0 assuming Nmax > Nmax D Nmin D Nmin D N (in other words, given Ni D Nmax , 0 6 Nj D N ( j D i) ) and a D a , and normalizing the conditions by M C Nmax D 1. Hence, we reduced the three conditions in Table 1 to those of the following parameters: l, a, Nmax , N. In each graph, the regions of each type are drawn with respect to l and a for xed Nmax and N (or N / Nmax ).
4 Discussion This study theoretically investigated, as a model of striatum neurons in the basal ganglia, a simple self-organization neuron model under the modulation of reinforcement signals. This model is motivated by experimental observations: directional-selective responses of the striatal neurons, abundant reinforcement signals carried by dopaminergic neurons, a rich source of inhibitory neurons in the striatum, and a vast convergence of the cortico-basal ganglia projection. By a theoretical analysis, the model is shown to explain, in a unied manner, that seemingly different neural response patterns, observed in experiments (Kawagoe et al., 1998), can emerge from the same model with different parameter values. Due to the choice of a simple model, our analysis can explicitly relate each response type with two factors, the reinforcement signal (a) and the inhibitory effect (l), in conjunction with the magnitudes and the proximity of the cortical input representations (M, Ni ).
836
H. Nakahara, S. Amari, and O. Hikosaka
Various types of self-organization rules have been proposed to investigate mainly the cerebral cortical formation (von der Malsburg, 1973; Amari & Takeuchi, 1978; Willshaw & von der Malsburg, 1979; Bienenstock, Cooper, & Munro, 1982; Obermayer, Ritter, & Schulten, 1990; Foldi ¨ a k, 1991). Experimental support has been largely obtained from the developmental formation of neural circuits in the sensory cortices (Wiesel & Hubel, 1963; LeVay, Stryker, & Shatz, 1978) and from the reorganization of the sensory cortical maps (Buonomano & Merzenich, 1988). The BCM theory in particular has been a very successful model in the developmental formation (Bienenstock et al., 1982; Kirkwood, Rioult, & Bear, 1996). All of these rules use the Hebbian synapse as a mathematical core, although controversy remains in the details. Their differences lie in the way each stabilizes learning such as competition among synapses (von der Malsburg, 1973), oating threshold (Bienenstock et al., 1982), or others. Different from others, the modiability of inhibitory synapses (Amari & Takeuchi, 1978) plays a fundamental role in regulating the self-organization in our model (see Figure 2C).Without the reinforcement signal a in equation 2.1, previous studies (Takeuchi & Amari, 1979; Amari, 1980, 1983) have indicated that this inhibitory synapse modiability, along with the excitatory synapse modiability, allows efcient input representations, such as a receptive eld self-organized with respect to various features of environment inputs, a formation of topographic maps, patch structures, and so on (Takeuchi & Amari, 1979; Amari, 1980, 1983). Hence, the current model with the reinforcement signal a can be expected to construct a cortical input representation effectively modulated by reinforcement signals, although a rigorous study remains to be investigated. This property is useful in the face of the strong anatomical convergence in the cortico-basal projections (Oorschot, 1996). Any divergent manner of the corticostriatal projections within this convergence (Parthasarathy, Schall, & Graybiel, 1992; Graybiel et al., 1994) may provide further combinatorial benets for the efciency in the self-organization. Generally, it is important to construct efcient representations of the state information (inputs) under the modulation of reinforcement signals (Dayan, 1991). A notable experiment has indicated that such a remapping under the inuence of a diffused neuromodulator occurs in the sensory cortex (Bakin & Weinberger, 1996; Kilgard & Merzenich, 1998; Sachdev, Lu, Wiley, & Ebner, 1998). Our model can be a primitive model for this issue. Abundant inhibitory sources exist in the striatum (Kawaguchi et al., 1995). The plasticity of inhibitory synapses is found in the hippocampus (Nusser, Hajos, Somogyi, & Mody, 1998), in the cerebellum (Kano, Rexhausen, Dreessen, & Konnerth, 1992), the cerebral cortex (Komatsu & Iwakiri, 1993; Komatsu, 1996) and other areas (Kano, 1995); recent ndings have suggested an important role of inhibitory connections in self-organization even in the cortex, for example, the early visual cortex (Hensch et al., 1998; Fagiolini & Hensch, 2000). To our knowledge, there is no direct evidence of modiable inhibitory synapses in the striatum, which is an important future study. We have not specied a neural origin of inhibitory effects. According to experimental ndings, a neural origin of inhibitory effects could be either inhibitory interneurons (Jaeger et al., 1994; Bennett & Bolam, 1994; Ko Âos & Tepper, 1999) or the collaterals of projection neurons (Groves, 1983; Wickens, 1993). If
Self-Organization in the Basal Ganglia
837
the former is the case, an inhibitory effect may work possibly in a feedforward manner under the inuence of cortical projections. If the latter is the case, an inhibitory effect may work as feedback and mutual inhibition and possibly in a manner similar to a winner-take-all mechanism (Groves, 1983; Wickens, 1993). It is possible to extend the current analysis with a recurrent connection. In this perspective, this study can be regarded as an extension of the winner-take-all mechanism. The current analysis of our model, however, treats the effect of modiable excitatory and inhibitory weights in a feedforward manner, because we do not commit ourselves to specify the inhibitory effect as the collaterals. In this sense, our model shares a feature of reward-modulated feedforward network with previous work by Barto and his colleagues (Barto, Sutton, & Brouwer, 1981; Barto, 1985). A third possibility is that inhibitory interneurons and the collaterals of projection neurons (Groves, 1983; Wickens, 1993) jointly work as an inhibitory source (Wickens, 1997). A recent demonstration of a weak collateral interaction between projection neurons also suggests this possibility (Tunstall, Kean, Wickens, & Oorschot, 2001). In the future, it is important to incorporate our model parameters with the experimentally estimated quantitative nature of these inhibitory resources (Jaeger et al., 1994; Wickens, 1997; Tunstall et al., 2001) as well as of the cortical input magnitudes and proximity (Oorschot, 1996; Kincaid et al., 1998). Generally, the DA modulation on the caudate neurons can be considered in two aspects: the synaptic efcacy (Calabresi et al., 1992; Wickens & Kotter, ¨ 1995) and the response property (Nicola et al., 2000). Caudate response changes are much slower than DA response changes over trials in one 1DR block. In addition, caudate response changes possibly occur beyond the DA phasic response in the time course of a single trial. Hence, we considered that the DA-modulated synaptic efcacy is involved in the emergence of the three response types and investigated how the DA-modulated synaptic plasticity can lead to these response types, while we considered the DA effect on the caudate ring rates to be complementary (see section 3.1). Yet this DA effect on the ring rates may play a larger role in accounting for the three response types (Gruber, 2000). Enhanced caudate activities long after the DA phasic response can be, partially at least, due to a prolonged DA effect that possibly sustains longer than the period of the phasic DA response (Gonon, 1997; Durstewitz, Seamans, & Sejnowski, 2000); The combined effect of D1 and D2 receptors on the caudate ring rates may further help shape the nature of the three response types (Gerfen, 1992; Cepeda et al., 1993; Nicola et al., 2000). How the two DA modulations, on the synaptic efcacy and on the response property, are integrated remains to be investigated. The dynamical aspect of the interaction between these modulations, possibly with the regulation of different DA receptors, is of particular interest because their timescales presumably are different. Finally, in our model, we treated only the current reinforcement signal, not delayed ones. To enjoy the full power of reinforcement learning, it is important to extend our model to include delayed-reinforcement signals (Barto, 1995; Houk, Adams, & Barto, 1995; Montague, Dayan, & Sejnowski, 1996; Berns & Sejnowski, 1998; Schultz, Dayan, & Montague, 1997; Trappenberg, Nakahara, & Hikosaka, 1998; Nakahara, Trappenberg, Hikosaka, Kawagoe, & Takikawa, 1998; Monchi
838
H. Nakahara, S. Amari, and O. Hikosaka
& Taylor, 1999; Hikosaka et al., 1999). For example, it is possible to maintain the same response property in our model even when the magnitude of a is reduced if the receptive region R stays the same with K ( x ) > 0 for x 2 R and 6 with K (x ) < 0 for x 2 R (see section 2.2). This kind of property is useful in transferring an effect of a delayed-reinforcement signal to a preceding stimulus. We are currently investigating this issue. Appendix A.1 Proof of Theorem 2. Recall that the order of the 1DR blocks is randomized in the experiment. Provided that the direction of x1 is the most preferred direction of the neuron, we need to consider how the behaviors of a neuron change in the three cases of transition of reward directions over 1DR blocks: (1) from the 1DR block where the reward direction is x1 to another, say x2 , (2) from the 1DR block where the reward direction is xj ( j D 2, 3, 4) , say x2 , to another 1DR block where the reward direction is not x1 , say, x3 , and (3) from the 1DR block where the reward direction is not x1 , say, x3 , to the 1DR block where the reward direction is x1 . Note also that similar to the proof of theorem 1, we assume that the neuron is excited by x1 initially. We begin with the condition guaranteeing that the neuron is responsive to only x1 at the equilibrium. In this case, the receptive region is R1 D fx1 g and hence pR1 D 14 and w1R D x1 . In equation 2.12, we require K ( x1 ) > 0 and K ( xi ) · 0 (i D 6 1) , which yield M · l < ( M C N1 ) (1 C a0 ) . In the equilibrium states of the other three 1DR blocks (here, we treat the case where x2 is the reward direction), the receptive region should be R2 D fx1 , x2 g. In this case, pR2 D 12 and w2R D 12 ( x1 C x1 ) . We require K ( xi ) > 0 ( i D 1, 2) and K ( xi ) · 0 ( i D 3, 4) , which leads to the following condition:
»
M · l < min M C
³
1 1 N1 , M C N2 2 2
´
¼
(1 C a0 ) .
As for the condition guaranteeing the transition from R D fx1 g to R D fx1 , x2 g as the reward direction changes, the neuron should respond to x2 in addition to x1 as the reinforcement signal is accompanied by x2 . Hence, we need to have wR1 ¢ x2 ¡ 1 Cl a > 0, which is equivalent to l < M (1 C a) . For case 2, the neuron has to respond to x3 in the beginning of the next block, that is, wR2 ¢ x3 ¡ 1 Cl a > 0, which is equivalent to l < M (1 C a) . When this condition is satised, the neuron responds to all of x1 , x2 , x3 when a new 1DR block starts. However, R3 D fx1 , x2 , x3 g is not stable under a certain condition, as stated in the following. In this case, competition should occur among these three inputs to let the neural response to the nonrewarded input x2 be silent. For R3 D fx1 , x2 , x3 g, pR3 D 34 and w3R D 13 ( x1 C x2 C x3 ) . Therefore, the response
Self-Organization in the Basal Ganglia
839
to x2 becomes silent when w3R ¢ x2 ¡ l · 0, or equivalently, MC
1 N2 · l. 3
In case 3, the neuron responds to both x1 and x3 in the rst block, and in the next block, the neuron changes to respond only to x1 . To ensure this requirement, we impose the condition under which the equilibria of R4 D fx1 , x3 g in the next block are not stable, which is written as MC
1 N3 · l. 2
By summing up all these conditions and taking all combinations of 1DR blocks into account, we get leads to
»
³
N0 N0 M C max · l < min M (1 C a) , M C min 2 2
´
N1 (1 C a ) , M C 2 0
¼ ,
0 0 where Nmax D maxi2I 0 Ni , Nmin D min i2I0 Ni , and I0 D f2, 3, 4g.
A.2 Proof of Theorem 3. It is sufcient to consider the condition for equilibrium states in one 1DR block and for transition from one block to another block. Suppose that the reward direction is x1 . A reverse-type neuron should have R D fx2 , x3 , x4 g as the equilibrium. This implies pR D 34 and w1R D 13 ( x2 C x3 C x4 ) . 6 In equation 2.12, we further require K ( x1 ) · 0 and K (xi ) > 0 ( i D 1) , which are equivalently rewritten together as M (1 C a0 ) · l < M C
1 Ni , where i D 2, 3, 4. 3
At the start of the next block (where the reward direction is, say, x2 ), we rst require that the neuron starts to respond to x1 to which the reinforcement signal a is no longer given. This requires w1R ¢ x1 ¡ l > 0, or equivalently, l < M. Provided that this condition is satised, the neuron responds to all of the directions in this block. Hence, we impose another condition under which R D fx1 , x2 , x3 , x4 g is not a stable equilibrium so that the neuron stops ring to the input x2 . This condition can be rewritten, given pR D 1 and w1R D 14 ( x1 C x2 C x3 C x4 ) , as
wR ¢ x2 ¡
1 1 C a0
l · 0,
or equivalently,
³ MC
1 N2 4
´ (1 C a0 ) · l.
Since the equilibrium condition of this block is the same as that of the rst block, summing up the above conditions leads to the theorem.
840
H. Nakahara, S. Amari, and O. Hikosaka
Acknowledgments We thank R. Kawagoe and Y. Takikawa for providing us with experimental details and H. Itoh for his technical assistance. H. N. is supported by grants-inaid 13210154 from the Ministry of Education, Science, Sports and Culture.
References Alexander, G. E., Crutcher, M. D., & DeLong, M. R. (1990). Basal gangliathalamocortical circuits: Parallel substrates for motor, oculomotor, “prefrontal” and “limbic” functions. In H. B. M. Uylings, C. G. Van Eden, J. P. C. De Bruin, M. A. Corner, & M. G. P. Feenstra (Eds.), Progress in brain research (Vol. 85, pp. 119–146). Amsterdam: Elsevier. Amari, S. (1977). Neural theory of association and concept-formation. Biological Cybernetics, 26, 175–185. Amari, S. (1980). Topographic organization of nerve elds. Bulletin of Mathematical Biology, 42, 339–364. Amari, S. (1983). Field theory of self-organizing neural nets. IEEE Transaction on Systems, Man and Cybernetics, SMC-13(9 & 10), 741–748. Amari, S., & Takeuchi, A. (1978). Mathematical theory on formation of category detecting nerve cells. Biological Cybernetics, 29, 127–136. Aosaki, T., Tsubokawa, H., Ishida, A., Watanabe, K., Graybiel, A. M., & Kimura, M. (1994). Responses of tonically active neurons in the primate’s striatum undergo systematic changes during behavioral sensorimotor conditioning. J. Neurosci., 6, 3969–3984. Bakin, J. S., & Weinberger, N. M. (1996). Induction of a physiological memory in the cerebral cortex by stimulation of the nucleus basalis. Proceedings of the National Academy of Sciences, 93, 11219–11224. Barto, A. G. (1985). Learning by statistical cooperation of self-interested neuronlike computing elements. Human Neurobiology, 4, 229–256. Barto, A. G. (1995). Adaptive critics and the basal ganglia. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 215–232). Cambridge, MA: MIT Press. Barto, A. G., Sutton, R. S., & Brouwer, P. S. (1981). Associative search network: A reinforcement learning associative memory. Biological Cybernetics, 40, 201– 211. Bennett, B. D., & Bolam, J. P. (1994). Synaptic input and output of parbalbuminimmunoreactive neurons in the neostriatum of the rat. Neuroscience, 62, 707– 719. Berns, G. S., & Sejnowski, T. J. (1998). A computational model of how the basal ganglia produce sequences. Journal of Cognitive Neuroscience, 10(1), 108–121. Bienenstock, E. L., Cooper, L. N., & Munro, P. W. (1982). Theory for the development of neuron selectivity: Orientation specicity and binocular interaction in visual cortex. Journal of Neuroscience, 2, 32–48. Buonomano, D. V., & Merzenich, M. M. (1998).Cortical plasticity: From synapses to maps. Annual Review of Neuroscience, 21, 149–186.
Self-Organization in the Basal Ganglia
841
Calabresi, P., Maj, R., Pisani, A., Mercuri, N. B., & Bernardi, G. (1992). Longterm synaptic depression in the striatum: Physiological and pharmacological characterization. Journal of Neuroscience, 12, 4224–4233. Cepeda, C., Buchwald, N. A., & Levine, M. S. (1993).Neuromodulatory actions of dopamine in the neostriatum are dependent upon the excitatory amino acid receptor subtypes activated. Proceedings of the National Academy of Sciences of the United States of America, 90, 9576–9580. Dayan, P. (1991). Navigating through temporal difference. In R. P. Lippmann, J. E. Moody, & D. S. Touretzky (Eds.), Advances in neural information processing systems, 3 (pp. 464–470). San Mateo, CA: Morgan Kaufmann. Durstewitz, D., Seamans, J. K., & Sejnowski, T. J. (2000). Dopamine-mediated stabilization of delay-period activity in a network model of prefrontal cortex. Journal of Neurophysiology, 83, 1733–1750. Fagiolini, M., & Hensch, T. K. (2000). Inhibitory threshold for critical-period activation in primary visual cortex. Nature, 404(6774), 183–186. Flaherty, A. W., & Graybiel, A. M. (1994). Input-output organization of the sensorimotor striatum in the squirrel monkey. J. Neurosci., 2, 599–610. Foldi ¨ ak, P. (1991). Learning invariance from transformation sequences. Neural Computation, 3, 194–200. Gerfen, C. R. (1992). The neostriatal mosaic: Multiple levels of compartmental organization. Trends in Neurosciences, 15(4), 133–139. Gonon, F. (1997). Prolonged and extrasynaptic excitatory action of dopamine mediated by D1 receptors in the rat striatum in vivo. Journal of Neuroscience, 17, 5972–5978. Graybiel, A. M. (1995). Building action repertoires: Memory and learning functions of the basal ganglia. Current Opinion in Neurobiology, 5, 733–741. Graybiel, A. M., Aosaki, T., Flaherty, A., & Kimura, M. (1994). The basal ganglia and adaptive motor control. Science, 265, 1826–1831. Groves, P. M. (1983). A theory of the functional organization of the neostriatum and the neostriatal control of voluntary movement. Brain Research, 286(2), 109–132. Gruber, A. (2000). A computational study of D1 induced modulation of medium spiny neuron response properties. Unpublished master’s thesis, Northwestern University. Hensch, T. K., Fagiolini, M., Mataga, N., Stryker, M. P., Baekkeskov, S., & Kash, S. F. (1998). Local GABA circuit control of experience-dependent plasticity in developing visual cortex. Science, 282(5393), 1504–1508. Hikosaka, O., Nakahara, H., Rand, M. K., Sakai, K., Lu, X., Nakamura, K., Miyachi, S., & Doya, K. (1999). Parallel neural networks for learning sequential procedures. Trends in Neuroscience, 22(10), 464–471. Hikosaka, O., Sakamoto, M., & Usui, S. (1989). Functional properties of monkey caudate neurons. II. visual and auditory responses. Journal of Neurophysiology, 61, 799–813. Houk, J. C., Adams, J. L., & Barto, A. G. (1995). A model of how the basal ganglia generate and use neural signals that predict reinforcement. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 249–270). Cambridge, MA: MIT Press.
842
H. Nakahara, S. Amari, and O. Hikosaka
Jaeger, D., Kita, H., & Wilson, C. J. (1994). Surround inhibition among projection neurons is weak or nonexistent in the rat neostriatum. Journal of Neurophysiology, 72(5), 2555–2558. Kano, M. (1995). Plasticity of inhibitory synapses in the brain: A possible memory mechanism that has been overlooked. Neuroscience Research, 21, 177– 182. Kano, M., Rexhausen, U., Dreessen, J., & Konnerth, A. (1992).Synaptic excitation produces a long-lasting rebound potentiation of inhibitory synaptic signals in cerebellar Purkinje cells. Nature, 356, 601–604. Kawagoe, R., Takikawa, Y., & Hikosaka, O. (1998). Expectation of reward modulates cognitive signals in the basal ganglia. Nature Neuroscience, 1(5), 411– 416. Kawagoe, R., Takikawa, Y., & Hikosaka, O. (1999). Change in reward-predicting activity of monkey dopamine neurons: Short-term plasticity. Society for Neuroscience Abstracts, 25, 1162. Kawaguchi, Y., Wilson, C. J., Augood, S. J., & Emson, P. C. (1995). Striatal interneurones: Chemical, physiological and morphological characterization. Trends in Neurosciences, 18(12), 527–535. Kilgard, M. P., & Merzenich, M. M. (1998). Cortical map reorganization enabled by nucleus basalis activity. Science, 279, 1714–1718. Kincaid, A. E., Zheng, T., & Wilson, C. J. (1998). Connectivity and convergence of single corticostriatal axons. Journal of Neuroscience, 18(12), 4722–4731. Kirkwood, A., Rioult, M. C., & Bear, M. F. (1996). Experience-dependent modication of synaptic plasticity in visual cortex. Nature, 381(6582), 526–528. Kita, K. (1993). GABAergic circuits of the striatum. In G. W. Arbuthnott & P. C. Emson (Eds.), Chemical signalling in the basal ganglia (pp. 51–72). Amsterdam: Elsevier. Knopman, D., & Nissen, M. J. (1991).Procedural learning is impaired in Huntington’s disease: Evidence from the serial reaction time task. Neuropsychologia, 29(3), 245–254. Knowlton, B. J., Mangels, J. A., & Squire, L. R. (1996).A neostriatal habit learning system in humans. Science, 273, 1399–1402. Komatsu, Y. (1996). GABAb receptors, monoamine receptors, and postsynaptic inositol trisphosphate-induced CA2+ release are involved in the induction of long-term potentiation at visual cortical inhibitory synapses. Journal of Neuroscience, 16(20), 6342–6352. Komatsu, Y., & Iwakiri, M. (1993).Long-term modication of inhibitory synaptic transmission in developing visual cortex. NeuroReport, 4(7), 907–910. Ko Âos, T., & Tepper, J. M. (1999). Inhibitory control of neostriatal projection neurons by GABAergic interneurons. Nature Neuroscience, 2(5), 467–472. LeVay, S., Stryker, M. P., & Shatz, C. J. (1978). Ocular dominance columns and their development in layer IV of the cat’s visual cortex: A quantitative study. Journal of Comparative Neurology, 179(1), 223–244. Marsden, C. D. (1980). The enigma of the basal ganglia and movement. Trends in Neuroscience, pp. 284–287. Monchi, O., & Taylor, J. G. (1999).A hard wired model of coupled frontal working memories for various tasks. Information Sciences, 113(3), 221–243.
Self-Organization in the Basal Ganglia
843
Montague, R., Dayan, P., & Sejnowski, T. J. (1996). Framework for mesencephalic dopamine systems based on predictive Hebbian learning. Journal of Neuroscience, 16, 1936–1947. Nakahara, H., Doya, K., & Hikosaka, O. (2001). Parallel cortico-basal ganglia mechanisms for acquisition and execution of visuo-motor sequences—a computational approach. Journal of Cognitive Neuroscience, 13: 5, 626–647. Nakahara, H., Trappenberg, T., Hikosaka, O., Kawagoe, R., & Takikawa, Y. (1998). Computational analysis on reward-modulated activities of caudate neurons. Society for Neuroscience Abstracts, 28, 1651. Nicola, S. M., Surmeier, J., & Malenka, R. C. (2000). Dopaminergic modulation of neuronal excitability in the striatum and nucleus accumbens. Annual Review of Neuroscience, 23, 185–215. Nusser, Z., Hajos, N., Somogyi, P., & Mody, I. (1998).Increased number of synaptic GABA(A) receptors underlies potentiation at hippocampal inhibitory synapses. Nature, 395(6698), 172–177. Obermayer, K., Ritter, H., & Schulten, K. (1990). A principle for the formation of the spatial structure of cortical feature maps. Proceedings of the National Academy of Sciences, 87(21), 8345–8349. Oorschot, D. E. (1996). Total number of neurons in the neostriatal, pallidal, subthalamic, and substantia nigral nuclei of the rat basal ganglia: A stereological study using the cavalieri and optical disector methods. Journal of Comparative Neurology, 366, 580–599. Parthasarathy, H. B., Schall, J. D., & Graybiel, A. M. (1992). Distributed but convergent ordering of corticostriatal projections: Analysis of the frontal eye eld and the supplementary eye eld in the macaque monkey. Journal of Neuroscience, 12, 4468–4488. Redgrave, P., Prescott, T. J., & Gurney, K. (1999). Is the short-latency dopamine response too short to signal reward error? Trends in Neuroscience, 22(4), 146– 151. Reynolds, J. N. J., & Wickens, J. R. (2000). Substantia nigra dopamine regulates synaptic plasticity and membrane potential uctuations in the rat neostriatum, in vivo. Neuroscience, 99, 199–203. Sachdev, R. N. S., Lu, S.-M., Wiley, R. G., & Ebner, F. F. (1998). Role of the basal forebrain cholinergic projection in somatosensory cortical plasticity. Journal of Neurophysiology, 79, 3216–3228. Schultz, W. (1998). Predictive reward signal of dopamine neurons. Journal of Neurophysiology, 80, 1–27. Schultz, W., Apicella, P., & Ljungberg, T. (1993). Responses of monkey dopamine neurons to reward and conditioned stimuli during successive steps of learning a delayed response task. Journal of Neuroscience, 13(3), 900– 913. Schultz, W., Apicella, P., Romo, R., & Scarnati, E. (1995). Context-dependent activity in primate striatum reecting past and future behavioral events. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 11–27). Cambridge, MA: MIT Press. Schultz, W., Dayan, P., & Montague, R. (1997). A neural substrate of prediction and reward. Science, 275, 1593–1599.
844
H. Nakahara, S. Amari, and O. Hikosaka
Schultz, W., Romo, R., Ljungberg, T., Mirenowicz, J., Hollerman, J. R., & Dickinson, A. (1995). Reward-related signals carried by dopamine neurons. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 231–248). Cambridge, MA: MIT Press. Surmeier, D. J., Song, W.-J., & Yan, Z. (1996). Coordinated expression of dopamine receptors in neostriatal medium spiny neurons. Journal of Neuroscience, 16, 6579–6591. Takeuchi, A., & Amari, S. (1979). Formation of topographic maps and columnar microstructures. Biological Cybernetics, 35, 63–72. Takikawa, Y., Kawagoe, R., & Hikosaka (2001). Reward-dependent spatial selection of anticipatory activity in monkey caudate neurons. Manuscript submitted for publication. Trappenberg, T., Nakahara, H., & Hikosaka, H. (1998). Modeling reward dependent activity pattern of caudate neurons. In International Conference on Articial Neural Network (ICANN98) (pp. 973–978). Sk¨ovde, Sweden. Tunstall, M. J., Kean, A., Wickens, J. R., & Oorschot, D. (2001). Inhibitory interaction between spiny projection neurons of the striatum: a physiological and anatomical study. In Abstracts for International Basal Ganglia Society VIIth International Triennial Meeting (p. 38). Waitangi, New Zealand. von der Malsburg, C. (1973). Self-organization of orientation selective cells in the striate cortex. Kybernetik, 14, 85–100. Wickens, J. (1993). A theory of the striatum. Oxford: Pergamon Press. Wickens, J. (1997). Basal ganglia: Structure and computations. Network: Computation in Neural Systems, 8, R77–R109. Wickens, J., & Kotter, ¨ R. (1995). Cellular models of reinforcement. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 187–214). Cambridge, MA: MIT Press. Wiesel, T. N., & Hubel, D. H. (1963). Single-cell responses in striate cortex of kittens deprived of vision in one eye. Journal of Neurophysiology, 26, 1003– 1017. Willshaw, D. J., & von der Malsburg, C. (1979). A marker induction mechanism for the establishment of ordered neural mappings: Its application to the retinotectal problem. Philos. Trans. R. Soc. Lond. B. Biol. Sci., 287(1021), 203–243. Wilson, C. J. (1998). Basal ganglia. In G. M. Shepherd (Ed.), The synaptic organization of the brain (4th ed., pp. 279–316). Oxford: Oxford University Press. Received August 30, 2000; accepted June 25, 2001.
LETTER
Communicated by Richard Zemel
Population Computation of Vectorial Transformations Pierre Baraduc
[email protected] Emmanuel Guigon
[email protected] INSERM U483, UniversitÂe Pierre et Marie Curie 75005 Paris, France
Many neurons of the central nervous system are broadly tuned to some sensory or motor variables. This property allows one to assign to each neuron a preferred attribute (PA). The width of tuning curves and the distribution of PAs in a population of neurons tuned to a given variable dene the collective behavior of the population. In this article, we study the relationship of the nature of the tuning curves, the distribution of PAs, and computational properties of linear neuronal populations. We show that noise-resistant distributed linear algebraic processing and learning can be implemented by a population of cosine tuned neurons assuming a nonuniform but regular distribution of PAs. We extend these results analytically to the noncosine tuning and uniform distribution case and show with a num erical simulation that the results remain valid for a nonuniform regular distribution of PAs for broad noncosine tuning curves. These observations provide a theoretical basis for modeling general nonlinear sensorimotor transformations as sets of local linearized representations.
1 Introduction
Many problems of the nervous system can be cast in terms of linear algebraic calculus. For instance, changing the frame of reference of a vector is an elementary linear operation in the process of coordinate transformations for posture and movement (Soechting & Flanders, 1992; Redding & Wallace, 1997). More generally, coordinate transformations are nonlinear operations that can be linearized locally (Jacobian) and become a simpler linear problem (see the discussion in Bullock, Grossberg, & Guenther, 1993). Vectorial calculus is also explicitlyor implicitlyused in models of sensorimotor transformations for reaching and navigation (Grossberg & Kuperstein, 1989; Burnod, Grandguillaume, Otto, Ferraina, Johnson, & Caminiti, 1992; Touretzky, Redish, & Wan, 1993; Redish & Touretzky, 1994; Georgopoulos, 1996). Neural Computation 14, 845–871 (2002)
° c 2002 Massachusetts Institute of Technology
846
Pierre Baraduc and Emmanuel Guigon
Although linear processing is only a rough approximation of generally nonlinear computations in the nervous system, it is worth studying for at least two reasons (Baldi & Hornik, 1995): (1) it displays an unexpected wealth of behaviors, and (2) a thorough understanding of the linear regime is necessary to tackle nonlinear cases for which general properties are difcult to derive analytically. The problem of the neural representation of vectorial calculus can be expressed in terms of two spaces: a low-dimensional space corresponding to the physical space of the task and a high-dimensional space dened by activities in a population of neurons (termed neuronal space) (Hinton, 1992; Zemel & Hinton, 1995). In this framework, a desired operation in the physical space (vectorial transformation) is translated into a corresponding operation in the neuronal space, the result of which can be taken back into the original space for interpretation. The goal of this article is to describe a set of mathematical properties of neural information processing that guarantee appropriate calculation of vectorial transformations by populations of neurons (i.e., that computations in the physical and neuronal spaces are equivalent). An appropriate solution relies on three mechanisms: a decoding-encoding method that translates information between the spaces, a mechanism that favors the stability of operations in the neuronal space, and an unsupervised learning algorithm that builds neuronal representations of physical objects. We will show that these mechanisms are closely related to common properties of neural computation and learning: the distribution of tuning selectivities in the population of neurons and the width of the tuning curves, the pattern of lateral connections between the neurons, and the distribution of input-output patterns used to build synaptic interactions between neuronal populations. In this article, we present a theory that unies these three mechanisms and properties (generally considered separately; Mussa-Ivaldi, 1988; Sanger, 1994; but see Zhang, 1996; Pouget, Zhang, Deneve, & Latham, 1998) into a unique mathematical framework based on the neuronal population vector (PV; Georgopoulos, Kettner, & Schwartz, 1988) in order to explain how neuronal populations can perform vectorial calculus. In contrast to our extensive knowledge of the representation of information by populations of tuned neurons, little attention has been devoted to the learning processes in these populations. Here we show how Hebbian and unsupervised errorcorrecting rules can be used in association with lateral connections to allow the learning of linear maps on the basis of input-output correlations provided by the environment. In this context, we reveal a trade-off between the width of the tuning curves and the uniformity of the distribution of preferred directions. Finally, a statistical approach validates our hypotheses in realistic networks of a few thousand noisy neurons. A particular application of this theoretical framework is the computing of distributed representations of transpose or inverse Jacobian matrices which play a central role in kinematic and dynamic transformations (Hinton, 1984;
Population Computation of Vectorial Transformations
847
Mussa-Ivaldi, Morasso, & Zaccaria, 1988; Crowe, Porrill, & Prescott, 1998). Recent results highlight the relevance of this theory to the understanding of the elaboration of directional motor commands for reaching movements (Baraduc, Guigon, & Burnod, 1999). 2 Notations and Denition
In the following text, we consider a population of N neurons. We note E D RN the neuronal space and E D RD (typically D D 2, 3) the physical space. Lowercase letters (e.g., x) are vectors of the neuronal space. Uppercase letters (e.g., X) are vectors of the physical space. Matrices are indicated by uppercase bold letters roman for E (e.g., M ), calligraphic for E (e.g., M ), and italic for D £ N matrices (e.g., E). A dot (¢) stands for the dot product in E or E. Each neuron j is tuned to a D-dimensional vectorial parameter, that is, it has a preferred attribute in E, which is noted Ej , and its ring rate is given by xj D fj ( X ¢ Ej , bj ) ,
(2.1)
where fj is the tuning function of the neuron, X a unit vector of the physical space (Georgopoulos, Schwartz, & Kettner, 1986), and bj a vector of parameters. The assumption is made that the distributions of these parameters and the distribution of PAs are independent (Georgopoulos et al., 1988). In the particular case of cosine tuning, the ring rate of neuron j is xj D X ¢ Ej C bj ,
(2.2)
where bj is the mean ring rate of the neuron (Georgopoulos et al., 1986). We note E the D £ N matrix of vectors Ej . The PAs are considered either as a set of xed vectors or as realizations of a random variable with a given distribution PE (in this latter case, index i is removed). The mean is denoted by hi and the variance by V. 3 Cosine Tuning and Vectorial Processing in Neural Networks
As a simple case of distributed computation, in this section, we derive conditions that are sufcient to represent and learn vectorial transformations between populations of cosine-tuned neurons. The case of other tuning functions will be treated later (section 4) in the light of this approach. 3.1 Encoding-Decoding Method: Distributed Representation of Vectors. Here we address the representation of vectors by distributed activity
patterns in populations of cosine-tuned neurons. We show that a condition on the distribution of preferred attributes is sufcient to faithfully recover information from the activity of the population. This condition is mathematically exact for populations of innite size, but still leads to accurate representations for populations of biologically reasonable size (e.g., > 103 ).
848
Pierre Baraduc and Emmanuel Guigon
The ring frequency of the population in response to the presentation of a vector X of the physical space is x D ET X C b, where x and b are vectors in E (equation 2.2 written in matrix notation). Based on some hypotheses on E and b, the vector X can be decoded by computing a population vector (Georgopoulos et al., 1988; Mussa-Ivaldi, 1988; Sanger, 1994). The population vector can be dened by X¤ D
1 X 1 ( xi ¡ bi ) Ei D E(x ¡ b) . N i N
A perfect reconstruction (X¤ / X) is obtained if the PAs are such that (MussaIvaldi, 1988; Sanger, 1994) EET / ID ,
(3.1)
where ID is the D £ D identity matrix. In a population of neurons, the offset b could be deduced from the activity of the network over a sufciently long period of time and subtracted via an inhibition mechanism (e.g., global inhibition if all neurons have the same mean ring rate). However, we will consider here the general case: X¤ D Q X C
1 Eb, N
where Q D N1 EET . We make the assumption that the components of the PAs have zero mean, are uncorrelated, and have equal variance sE2 (regularity condition). From our hypothesis, mean ring rates b are independent of the distribution of PAs. Then Q converges in probability toward sE2 ID (see section A.1). Using similar arguments, we can demonstrate that N1 Eb converges in probability toward 0. In the following, we call a family of tuning properties fEi , bi g that satises the regularity condition regular basis. We use the term basis to indicate that a regular family can be used as a basis. However, it is not a basis in a mathematical sense. If X 2 E, x D ET X C b is called the distributed representation of X, or simply a population code. 3.1.1 Finite N. The preceding equalities hold only at the limit N ! C 1. To ascertain if the proposed computational scheme has any relevance to biology, we need to quantify the distortions introduced when populations of nite size are used. Without loss of generality, we can suppose the input to be X D (1, 0, . . . , 0) . The variance of the decoded output, normalized by 1 / sE2 , is in this case
³
V
Ex sE2
´
³
DV
where d 2 D V(E2i1 ) .
1 N 2 sE4
´
T
EE X
2 4 D (d / sE , 1, . . . , 1) / N,
Population Computation of Vectorial Transformations
849
Is this variance small enough in practice? For a uniform distribution of PAs on a three-dimensional sphere, d 2 / sE4 D 13 / 5, which results in an angular variance of less than 0.55 degree for N D 1000. For a distribution of PAs (of the same norm) clustered along the axes, d 2 / sE4 D 2—hence, an angular variance of less than 0.48 degree if N D 1000. This suggests that this encoding scheme is reasonably accurate with small populations of neurons. The regularity condition thus guarantees that encoded information can be recovered from actual ring rates with an arbitrary precision in a sufciently large population of neurons. The regularity condition includes a zero-mean assumption for PA components, which is not used in Sanger (1994). Any departure from this requirement translates the output vectors by a constant amount, which needs to be small in practice. The zero-mean assumption is not a major constraint, since most experimentally measured distributions of selectivity are roughly symmetrical (see section 6). In this sense, our denition of regularity is more general than the previous ones (Mussa-Ivaldi, 1988; Sanger, 1994; Zhang, Ginzburg, McNaughton, & Sejnowski, 1998), as it allows a proper probabilistic treatment when mean ring rate is nonzero. 3.2 Distributed Representation of Linear Transformations. The preceding section has shown how a correspondence can be established between vectors in external space and population activity. In this section, we extend this correspondence to linear mappings and dene the notion of input and output preferred attributes. Consider a linear map from E to F, which are real physical vectorial spaces. Let M be its matrix on the canonical bases and E, F be regular bases in E (NE neurons) and F (NF neurons), respectively. We dene
M D
1 FT M E NE NF
(3.2)
as the matrix of the distributed linear map. In the limit NE , NF ! C 1, and assuming that sE D sF D 1, we have Q E D Q F D ID . Then M operates on the distributed representations as M does in the original space. Let x be the distributed representation of a vector X 2 E (i.e., x D ET X). Taking Y D M X, we have M x D F T M EETX D F T ( M X ) D F T Y. Thus, y D M x is the distributed representation of Y. If we assume that the vectorial input (resp. output) is represented by the collective activity of a population of neurons xj (resp. yi ), and that a weight matrix M links the input and output layers, then the network realizes the transformation M on the distributed vectors. It is immediate that F M ET D M . Thus, the distributed map can be read using the classical population vector analysis.
850
Pierre Baraduc and Emmanuel Guigon
3.2.1 Finite NE and NF . As in the preceding section, it must be checked whether this distributed computation is still precise enough in the case of nite populations. To answer this question while keeping the derivations simple, we assume NE D NF D N, sE D sF D s, and take the identity mapping for the transformation M . The variance of the (normalized) decoded output Y/ s 2 writes in this case: ³ V
FI x s2
´
³ D V
1 FF T EET X N2 s 4
´
1 [V( Q F ) hQ E i2 C hQ F i2 V( Q E ) C V( Q F ) V( Q E ) ] X2 s8 µ » ¼ ¶ 1 d2 1 ( ID,D ¡ ID ) C I D C ID,D C ¢ ¢ ¢ X2 , D 2 N Ns 4 N D
where Im,n is an m £ n matrix of ones and the ellipsis stands for terms dominated by 1 / N2 . Here the notation Q 2 means the matrix of ij-component Q 2ij . For D D 3 and N D 1000, in the case of a uniform distribution, the preceding equation translates into an angular variance of 0.84 degree; in the clustered case, the variance is 0.74 degree. Our scheme of distributed computation is thus viable with small populations of neurons. Consequently, in the following sections, derivations will be made for innite populations with EE T D I D , which allows us to write equalities instead of proportionalities. We will thereby ignore the s 2 N term, except in the study of the effect of noise (see section 3.3.2). We will also assume that b D 0, which makes proofs more straightforward. The general case is considered in section A.3. 3.2.2 Selectivities of Output Units. In a network that computes y D M x, how can one characterize the behavior of an output unit that res with yi ? This output unit i can be described by its intrinsic PA Fi in the output space F. However, this vector is independent of the mapping that occurs between the input and output spaces, and thus does not fully dene the role of the unit. In fact, two vectors can be associated with the output unit i. The rst is the vector of E for which the unit is most active (input PA). Since the unit i res with input X as FTi M X, it is cosine tuned to the input, and its input PA is the column vector M T Fi . The second vector (output PA) is M † Fi , where M † is the Moore-Penrose inverse of M . In the case where the output layer is considered as a motor layer whose effects can be measured in the input space through M † , the output PA can be interpreted in an intuitive way. Indeed, the effect in sensory space of the isolated stimulation of the unit i is precisely the vector M † Fi of F. Thus, the output PA corresponds to projective properties of the cell while input PA is related to receptive properties. Note that in general, the input and output PAs of a unit do not coincide (Zhang et al., 1998).
Population Computation of Vectorial Transformations
851
3.2.3 Weight and Activity Proles. The distributed representation M has interesting structural properties. The transpose of the ith row of M is (FTi M E) T D E T ( M T Fi ) 2 Im ET . In the same way, the jth column of M is F T ( M EjT ) 2 Im F T . Thus, the prole of the weight rows (resp. columns) is identical to the prole of the input (resp. output) activities. Later we will consider the case where the entries of the matrix M are activities rather than static weights. We show below that “cosine” lateral connections between rows (ET E) and columns (F T F) stabilize population codes in E and F , respectively. Thus, lateral connections can help to build an exact matrix of activities from an underspecied initial state. 3.3 Neuronal Noise and Stabilization of Representations. Noise has a strong impact on population coding (Salinas & Abbott, 1994; Abbott & Dayan, 1999). Therefore, it is important to understand how noise affects the reliability of our computational scheme. We will consider here two forms of noise: additive gaussian and Poisson noise.
3.3.1 Additive Gaussian Noise. Assume that a gaussian noise g is added to the population code x. How does this noise affect the encoding-decoding scheme—that is, how large is the variance of the decoded quantity? If g is independently distributed, we can show that the variance of the extra term due to the noise (Eg / N ) is proportional to 1 / N (see section A.2). Conversely, if the additive noise is correlated among neurons, as seems to be the case in experimental preparations (Gawne & Richmond, 1993; Zohary, Shadlen, & Newsome, 1994), it is easy to demonstrate that V (Eg / N ) D
(1 ¡ c) sg2 sE2 N
,
(3.3)
where sg2 is the variance and c the correlation coefcient of the noise. Thus, for this correlated noise as for the uncorrelated one, the variance of the encoded quantity decreases with a 1 / N factor. Besides, the decoding error decreases as a function of c, as does the minimum unbiased decoding error (Abbott & Dayan, 1999). In fact the correlations act to decrease the total entropy of the system. The 1 / N reduction of variance demonstrated for additive gaussian noise no longer holds with multiplicative noise; in such a case, an active (nonlinear) mechanism of noise control may be needed. 3.3.2 Poisson Noise. In the case of an uncorrelated Poisson noise, V (gi ) D xi . It is straightforward to show that the variance of the noise term is inferior to xmax / N, where xmax is the highest ring rate in the population (see section A.2). Thus, as for the gaussian noise, the variance decreases linearly with the number of neurons. Correlations in the noise alter this behavior, and the variance becomes dominated by a term independent of N. This term
852
Pierre Baraduc and Emmanuel Guigon
can be computed for a few special cases of PA distribution; for example, we have V (Egi / N ) · 0.035 c xmax
(3.4)
for a uniform 3D distribution and V (Egi / N ) · 0.22 c xmax
(3.5)
for PAs clustered along the 3D axes (see section A.2). A reduction of the variance in the correlated case is thus obtained through the distributed coding, even if scaling N does not result in any additional benet. It can also be noted that uniform distributions of PAs seem more advantageous as far as noise issues are concerned. To sum up, for the two types of noise treated here, the variability in the decoded quantity is inferior to the variability affecting individual neurons. For gaussian or uncorrelated Poisson noise, using large populations of cells limits even more the noise in the decoded vectors, as noise amplip tude depends on 1 / N. This is not the case with correlated Poisson noise, and more powerful nonlinear methods could be employed (see, e.g., Zhang, 1996; Pouget et al., 1998). 3.3.3 Stabilizing Distributed Representations. The reduction of the noise in the decoded vector shown in the preceding sections can inspire ways to limit the noise inside a population. We show here that ltering the population activity through the matrix W E D ET E / N has this desirable effect. Before proving this fact, we rst note that W E is the distributed representation of ID in E (see equation 3.2). Matrix W E is a projection of E (Strang, 1988). If we note Ep the image of E by W E , then Ep is a D-dimensional subspace of E . Elements of Ep are population codes since they can be written ET (E x0 ) , x0 2 E . In fact, the operation of W E is a decoding-reencoding process. As the variance of the decoded vector coordinates is inferior to the neuronal variance (preceding sections), we can expect from W E good properties regarding noise control. To demonstrate them, we write W E ( x C g) D W E x C ET Eg / N. The term W E x is in general different from x (except if x 2 Ep ), but it preserves part of the information on x, since the population vectors of x and W E x are the same. The variance of W E g is the rst diagonal of (ET EQ ET E) / N 2 , where Q is the correlation matrix of the noise. Building on the results of the previous sections, it is easy to demonstrate that equations similar to equations 3.3, 3.4, and 3.5 apply. For additive gaussian noise, we nd that V (W E
g) D
(1 ¡ c )sg2 sE2 N
IN .
Population Computation of Vectorial Transformations
853
Thus, the effect of W E is to limit gaussian noise in the population. For Poisson noise, the formulas of section 3.3.2 generalize in the same way, leading to a decrease in the variance of the neuronal activity that is proportional to 1 / N for uncorrelated noise and independent of N in the correlated case. 6 Moreover, even if hgi D 0, for independent noise, we get hW E gi D 0 in the limit N ! C 1. This property, due to the fact that W E has balanced weights, can be used to sort out the relevant information from a superposition of uncorrelated codes. The matrix W E can be viewed as a weight matrix, either of feedforward connections between two populations of NE neurons or of lateral interactions inside a population, and extracts the population code of any input pattern in a single step. However, if W E slightly deviates from the denition, it is no longer a projection, and iterations of W E are likely to diverge or fade. A simple way to prevent divergence is to use a saturating nonlinearity (e.g., sigmoid). A more realistic solution is to adjust the shape of a nonsaturating nonlinearity to guarantee a stable behavior (Yang & Dillon, 1994; Zhang, 1996). In particular, an appropriate scaling of the gain of the neurons (maximum of the derivative of the nonlinearity) to the largest eigenvalue of W E leads to the existence of a Lyapunov function for continuous network dynamics. If the distribution of PAs is uniform, W E is a circulant matrix (Davis, 1979). Iterations of a circulant matrix can extract the rst Fourier component of the input, provided the rst Fourier coefcient of the matrix is greater than 1 and all other coefcients strictly less than 1 (Pouget et al., 1998). Here, W E corresponds to the special case where the rst Fourier coefcient is 1 and all others are zero. We could as well consider W F D F T F as a matrix of recurrent connections on the output layer to suppress noise on this layer. 3.4 Learning Distributed Representations of Linear Transformations.
Up to now, we have demonstrated that a correspondence between external and neural spaces could be established and maintained. This correspondence permits a faithful neural representation of external vectors and mappings. It remains now to be shown whether a distributed representation of a linear mapping can be built from examples using a local synaptic learning rule. We prove below that it is indeed possible, provided the training examples satisfy a part of the regularity condition. 3.4.1 Hebbian Learning of Linear Mappings. Let M be a linear transformation and ( Xº, Yº D M Xº ) be training pairs in E £ F, º D 1, . . . , Nex . Hebbian learning writes
M¤ij /
Nex X ºD 1
yºi xºj ,
854
Pierre Baraduc and Emmanuel Guigon
where ( xº, yº ) are the distributed representations of the training samples. Then, X X M¤ / F T Yº ( Xº) T E / F T M Xº ( Xº) T E / F T M E º
º
if the training examples satisfy X º
Xº ( Xº ) T / I dim E .
(3.6)
In this case, the matrix M ¤ is proportional to the required matrix. Thus, any distributed linear transformation can be learned modulo a scaling factor by Hebbian associations between input and output activities if the components of the training inputs are uncorrelated and have equal variances (zero mean is not required). In practice, to control for the weight divergence implied by standard Hebbian procedures, the following stochastic rule is used:
D M¤ij / ( yºi xºj ¡ M¤ij ) .
(3.7)
3.4.2 Nonregular Distribution of Examples and Tuning Properties. Regularity may be a restricting condition in some situations. Distributions of PAs are not necessarily regular, or it may not be possible to guarantee that training examples are regularly distributed. This latter case can occur when learning a (generally ill-dened) linear mapping from samples of its inverse (Kuperstein, 1988; Burnod et al., 1992; Bullock et al., 1993). We denote M 1 the inverse mapping. Training consists of choosing an output pattern yº, calculating the corresponding input pattern xº D ET M 1 Fyº, and then using (xº, yº) as examples. If the yº are regular, Hebbian learning leads to the representation of M T1 but not M 1¡1 (or a generalized inverse if M 1 is singular or noninjective). An appropriate solution to this problem is obtained if the learning takes place only for the Mij receiving maximal x input, that is, Mijmax (º) , where jmax (º) D arg max xºj . If the vectors xº have the same mean norm, we can assimilate xº whose largest coordinate is the jth to ej (distributed representation of Ej ). Then the jth column of M writes 0
M¢j D
X E
TM
1
F
yº D ej
yº D F T @
1 X
YºA .
(3.8)
( Ej ) Yº 2M ¡1 1
It is clear that the latter sum is an element of M 1¡1 ( Ej ). Section A.4 shows that when the Fi are regular, the sum converges toward M †1 Ej . The matrix M is then exactly the distributed representation of the Moore-Penrose inverse of M 1 . Informally, this winner-take-all learning rule works by equalizing
Population Computation of Vectorial Transformations
855
learning over input vectors, whatever their original distribution. In pratice, a soft competitive approach can be used (e.g., to speed up the learning), but the proportion of winners must be kept low in the presence of strong anisotropies. It must be noted that this applies only if the vectors xº have the same norm on average. If this condition is not fullled, a correction by 1 / kxºk2 must be applied. This rule developed in a Hebbian context naturally extends to the parameter-dependent case. 3.4.3 Learning Parameter-Dependent Linear Mappings. We now treat the more general case where a linear mapping depends on a parameter. Typically, such a mapping arises as a local linear approximation (Jacobian) of a nonlinear transformation (see Bullock et al., 1993). Consider a nonlinear mapping y D w ( Â ) (e.g., w is the inverse kinematic transformation for an arm; Â are the cartesian coordinates of the arm end point and y the joint angles). Linearization around Â0 gives yP D M ( Â0 ) Â, P M being the Jacobian of w . If the value y 0 D w ( Â0 ) is given, the nonlinear mapping can be computed by incrementally updating y with yP D M ÂP along any path starting at Â0 . Thus, the problem reduces to computing a parameter-dependent linear mapping, which can be written, using previous notations, as Y D M ( P ) X, where P is a parameter. We denote by P the physical space of parameters and P the space of the neuronal representation of parameters (e.g., P is the two-dimensional space of joint angles and P can be a set of postural signals). A solution to this problem is to consider the coefcients Mij corresponding to the distributed representation of M not as weights, but as activities of neurons modulated by the parameter P 2 P, and to assume a multiplicative interaction between Mij and xj . In the simplest case where P modulates linearly the coefcients, this can be written
³ y D Mx
and
M DV p
i.e., Mij D
X
V
´ ijk p k
,
(3.9)
k
where V is a set of weights dened over E £ F £ P and p 2 P . Then the mapping is learned by retaining for each neuron of layer P M the relationship between the input p and the desired output M¤ij D º yºi xºj . Thus, the weights V can be obtained by
DV
ijk
/ ( yºi xºj ¡ Mij ) pºk ,
(3.10)
which is a stochastic error-correcting learning rule. Contrary to the standard delta rule, equation 3.10 does not require an external teacher, as the reference signal is computed internally. Moreover, connectivity V ijk can be far from complete, as lateral connections between Mij units can help to form the desired activity prole (see section 3.3.3; Baraduc et al., 1999). Note that if
856
Pierre Baraduc and Emmanuel Guigon
the parameter P is coded by a population of cosine-tuned neurons (i.e., p is a distributed representation of P), then equation 3.10 simplies to a Hebbian rule:
DV
ijk
/ yºi xºj pºk .
In a more general case, the activities Mij can depend on p via a perceptron or a multilayer perceptron. The learning rule, equation 3.10, can then be transformed to include a transfer function and possibly be the rst step of an error backpropagation. 4 Generalization to Other Tuning Functions
It can be asked whether the mechanisms and properties of distributed computation proposed here depend on the specic cosine tuning that has been assumed (see equation 2.2). We now show that these results can be extended to a broad class of tuning functions (see equation 2.1), if we assume that the Ei have a uniform distribution. Following Georgopoulos et al. (1988), we use a continuous formalism (see also Mussa-Ivaldi, 1988). Given the previous assumptions, the uniformity guarantees that the population vector points in the same direction as the encoded vector (Georgopoulos et al., 1988): Z Z f (X ¢ E, b ) E dPE dPb D X.
(4.1)
The independence of b and E allows writing (Georgopoulos et al., 1988) ´ Z Z Z ³Z f ( X ¢ E, b ) E dPE dPb D f ( X ¢ E, b ) E dPE|b dPb . Thus, any demonstration made with constant b can be easily generalized to varying b. Accordingly, we remove b in the following calculations. 4.1 Encoding-Decoding Method.
4.1.1 Distributed Representation of Vectors. The distributed representation of a vector X in E is no more a vector but a function x D x ( E ) D f ( X ¢ E ) . According to our hypothesis, the vector X can be recovered from its distributed representation x (see equation 4.1). The dot product of the distributed representations of two vectors X and Z in E is dened by Z h ( X, Z) D
f ( X ¢ E ) f ( Z ¢ E ) dP E .
(4.2)
We rst observe that h can be manipulated as a tuning function. A vector can be reconstructed from tuning curve functions (see equation 4.1), as well
Population Computation of Vectorial Transformations
857
as from h: Z Z Z ( ) h X, E E dPE D f ( X ¢ E0 ) f (E ¢ E0 ) E dP E dPE0 Z Z D f ( X ¢ E0 ) f ( E ¢ E0 ) E dPE dPE0 | {z } E0
(4.3)
D X.
This property is immediate in the cosine case since h D f D dot product. In the general case, it can be shown that h ( X, Z) is a function of X ¢ Z and that if f is nondecreasing, so is h (see section A.5). 4.1.2 Distributed Representation of Linear Transformations. There is a theoretical form (no longer a matrix, but a function) for the distributed representation of a linear mapping M . It is dened by
M ( E, F ) D g ( FT M E) and
Z
M ( E, F ) x (E ) dPE ,
y (F) D
(4.4)
where y ( F ) is the distributed output corresponding to the distributed input x (E ) D f ( X ¢ E ) of a physical vector X. This exact counterpartR of the cosine case (see equation 2.2) is easily demonstrated by showing that y ( F ) F dPF D F Y, with Y D M X. 4.2 Stabilizing Distributed Representations. In the same way, there is
a straightforward generalization of matrix W
W E
0
E
(see section 3.3.3) dened by
0
(E, E ) D f ( E ¢ E ) .
However, unlike the cosine case, these theoretical forms are not particularly useful since they are not in general similar to versions obtained by learning. Thus, in the following section, we derive and use Hebbian versions M ¤ and W ¤E of M and W E . 4.3 Learning Distributed Representation of Linear Transformations.
The learning rules for the xed or the parameter-dependent mapping still apply. We use a continuous formalism for both tuning functions and training examples. A straightforward derivation proves that the distributed transformation can be learned as before through input-output correlations. It can be shown that the distributed map corresponding to a linear transformation M between vectorial spaces E and F is represented by the function Z M ¤ ( E, F) D (4.5) f ( Xº ¢ E ) g ( M Xº ¢ F ) dPº,
858
Pierre Baraduc and Emmanuel Guigon
where Pº is the distribution of training examples (see appendix A.6). It can be seen that M ¤ ( E, F ) is a function of E ¢ F using the method developed for equation 4.2. Next we dene W ¤E as the distributed representation of the identity mapping on E obtained by learning (see equation 4.5) Z W ¤E ( E, E0 ) D f ( Xº ¢ E ) f ( Xº ¢ E0 ) dPº. From equation 4.2, we see that W ¤E D h (E, E0 ) . The function W ¤E can be used as a feedforward or lateral interaction function. Any input distribution x ( E ) is transformed as Z ¤ ( ) W E x E D h ( E, E0 ) x ( E0 ) dPE0 . (4.6) If x is the distributed representation of a vector X of the physical space, it is immediately clear that Z W ¤E x ( E) D h ( E, E0 ) f ( X ¢ E0 ) dPE0 , which is a function of nondecreasing function X ¢ E (see the method in section A.5). W ¤E modies the prole of activity, but changes neither the preferred attribute nor the center of mass of a population code. If x is any distribution, the result of equation 4.6 depends on the shape of the dot product function (and thus the tuning function since the two are tightly related; see section 4.4). W ¤E is a Fredholm integral operator with kernel h. If the kernel is degenerate—that is, it can be written as a nite sum of basis functions (e.g., Fourier series)—Im W ¤E is a nite dimensional space generated by these functions. Thus, W ¤E suppresses all other harmonics. The case of a cosine distribution of lateral interactions (see section 3) corresponds to a two-dimensional space generated by cos and sin functions (and W ¤E is a projection). Gaussian distribution of weights, which contains a few signicant harmonics, is known empirically to suppress noise efciently (Douglas, Koch, Mahowald, Martin, & Suarez, 1995; Salinas & Abbott, 1996). However, W ¤E is not in general a projection, which is problematic if W ¤E represents a transform through recurrent connections. Solutions in the discrete spatial case have been discussed (see 3.3.3) and extend to this case (Zhang, 1996). In particular, the scaling of the largest eigenvalue is possible since W ¤E has the largest eigenvalue, which is equal to kW ¤E k. After learning, the output neurons are not tuned to input vectors as they are during the learning phase; that is, g (M X¢F ) are not their tuning functions. Indeed, activity of an output neuron is Z (4.7) y ( F) D h (X, Xº ) g ( M Xº ¢ F ) dPº, º
Population Computation of Vectorial Transformations
859
which is generally not equal to g ( M X¢F ) . However, using the same reasoning as for equation 4.2 (see section A.5), we can show that y (F ) D gQ ( M X ¢ F ) . It follows that the y are still broadly tuned to M X; moreover, the PAs in input space keep the same expression FT M as in the cosine case. 4.4 Numerical Results for 2D Circular Normal Tuning Functions. Contrary to the cosine case, learning with tuning function g leads to a different output tuning g. Q Is this change important? How similar are these two tuning curves? We illustrate here the differences among the intrinsic tuning functions ( f and g), the dot product (h), and the output tuning function ( g), Q using circular normal (CN) tuning functions (Mardia, 1972) in R2 . These functions have a prole similar to a gaussian while being periodic. Their general expression is f (cosh ) D AeK cos h C B. We used the following version for both input and output tuning,
f ( u) D g (u) D
eKu ¡ e ¡K , eK ¡ e ¡K
where K controls the width at half-height. Thus, f and g take values between 0 and 1 if the coded vectors and the PAs are unit vectors. With these assumptions, h D f ¤ f , where ¤ is the convolution, and thus their respective Fourier coefcients verify hO n D fOn2 . Interestingly, the distribution of the Fourier coefcients of CN functions is such that h, once normalized between 0 and 1, is very close to a broader CN function hCN . In our numerical simulations, the relative error was khnormalized ¡ hCN k < 2%, khnormalizedk where khk denotes the L 2 -norm of h. However, the convolution leads to a widening of h compared to f (see Figure 1A), since it favors the largest Fourier coefcients, which happen to be the rst for CN functions. This broadening effect is maximal for f of width ¼ 110 deg (see Figure 1B). Since gQ D h ¤ g (see equation 4.7), gQ is still broader than h (see Figure 1B). These results show that feedforward or recurrent neural processing preserves the general shape of intrinsic tuning functions but increases their width. After about two to ve feedforward steps, the tuning of output neurons ( g) Q is close to a cosine. 5 Deviations for Nonuniform Distributions of PAs
The preceding results on noncosine tuning curves were obtained for a uniform distribution of PAs, whereas a weaker constraint (regularity condition)
860
Pierre Baraduc and Emmanuel Guigon
Figure 1: (A) Shape of the intrinsic tuning curve of input and output neurons ( f , dotted line), the distributed dot product (h, gray line), and input tuning of output neurons (Qg, solid line). The width (at half-height) of f was 60± (K D 5.2). The curves for h and gQ were constructed from the rst 20 Fourier coefcients of f . (B) Width (at half-height) of h (gray line) and gQ (solid line) as a function of the width of f (K D .01–45).
was sufcient in the cosine case. Here we explore numerically to what extent the population computation can be accurate for a regular nonuniform distribution of PAs. In relation to electrophysiological data (Oyster & Barlow, 1967; Lacquaniti, Guigon, Bianchi, Ferraina, & Caminiti, 1995; Wylie, Bischof, & Frost, 1998), such a distribution was assumed clustered along preferred axes (here in 2D). To express the clustering along the axis h D 0, the probability density of a vector E D (cosh , sin h ) was assumed to follow dPE / dh / exp(¡h 2 / V ) for h 2] ¡ p / 4I p / 4]. The same density was used modulo p / 2 for the directions h D p / 2, p , and 3p / 2. The resulting densities for four different values of V, from V D 3 (moderately clustered distribution) to V D 10¡12 (PAs aligned on the axes), are plotted in the inset of Figure 2.
Population Computation of Vectorial Transformations
861
Figure 2: Precision of the distributed computation as measured by the discrepancy between the decoded input and output in the case of the distributed identity mapping. The error in the transformation is plotted as a function of the tuning width of f and the clustering of the 2D basis vectors E around the axes. The inset shows the four distributions of E that have been tested (see the text). For each condition, the error was calculated as the mean absolute difference between encoded and decoded vectors over 1000 trials (i.e. 1000 randomly chosen encoded vectors).
To illustrate how the scheme of distributed computation proposed here behaves in these conditions, we measured the errors induced by the distributed computation W E of the identity function. The population was sampled exactly regular, so that heavy computations involving large numbers of neurons could be avoided. Assuming tuning functions to be circular normal, we computed the angular difference between the decoded input and output vectors for different distributions of E and different tuning widths. The results shown in Figure 2 were obtained by computing the identity on 1000 random vectors on a regular population of 1000 neurons. As expected, the most uniform distribution behaves best, generating very small errors. The deviation of the population vector increases with the clustering of the basis vectors. However, the more the tuning curves broaden, the less this effect is pronounced. In particular, if the tuning width is greater than 100 degrees, the directional error in the population vector is always inferior to 5 degrees. We conclude that the distributed computation of linear
862
Pierre Baraduc and Emmanuel Guigon
mappings is still possible with minimal error in the case of clustered PAs when tuning curves are sufciently broad. 6 Discussion
This article has addressed the calculation of vectorial transformations by populations of cosine-tuned neurons in a linear framework. We have shown that appropriate distributed representations of these transformations were made possible by simple and common properties of neural computation and learning: decoding with the population vector, regular distributions of tuning selectivities and input-output training examples, Hebbian learning, and cosine-tuned lateral interactions between neurons. We have analytically extended this result to the noncosine broadly tuned case for uniform distributions and numerically to regular nonuniform distributions. The use of the population vector may appear problematic because it is in general not an optimal decoding method (Salinas & Abbott, 1994). Statistical optimality is clearly an important theoretical issue (Snippe, 1996; Pouget et al., 1998), but it is unclear whether it is also a relevant concept for computation in the nervous system. As emphasized by several authors (Paradiso, 1988; Salinas & Abbott, 1994), the use of a large number of cells to estimate a parameter is likely to overcome variability in single-cell behavior. In fact, accuracy (small bias and low variance compared to the coding range) may be more important than optimality. Furthermore, the main difculty with the PV method is its poor behavior when used for biased distributions of preferred directions (Glasius, Komoda, & Gielen, 1997) or populations of sharply tuned neurons (Seung & Sompolinsky, 1993). We have restricted our theory to regular or uniform distributions and broadly tuned neurons. For regular distributions, the PV method is an optimal linear estimator (Salinas & Abbott, 1994). Broadly tuned neurons allow the PV method to approach the maximum likelihood method for Poisson noise (Seung & Sompolinsky, 1993). The question arises whether electrophysiological data actually satisfy the regularity condition. This is clearly the case for uniform distributions (Georgopoulos et al., 1986; Schwartz, Kettner, & Georgopoulos, 1988; Caminiti, Johnson, Galli, Ferraina, & Burnod, 1991). However, not all distributions are uniform (Hubel & Wiesel, 1962; Oyster & Barlow, 1967; van Gisbergen, van Opstal, & Tax, 1987; Cohen, Prud’homme, & Kalaska, 1994; Prud’homme & Kalaska, 1994; Lacquaniti et al., 1995; Rosa & Schmid, 1995; Wylie et al., 1998), and it remained to be checked if these distributions are regular. A particular distribution is a clustering of PAs along preferred axes (Oyster & Barlow, 1967; Cohen et al., 1994; Prud’homme & Kalaska, 1994; Lacquaniti et al., 1995; Wylie et al., 1998; see also Soechting & Flanders, 1992). Populations of neurons in posterior parietal cortex of monkeys have such a distribution of PAs and satisfy the regularity condition (p < 0.01, unpublished observations from the data of Battaglia-Mayer et al., 2000). The same was seen
Population Computation of Vectorial Transformations
863
in anterior parietal cortex (E. Guigon, unpublished observations from the data of Lacquaniti et al., 1995). This latter observation indicates that vectorial computation can occur not only in uniformly distributed neuronal populations, but also at the different levels of a sensorimotor transformation where neurons are closely related to receptors or actuators (Soechting & Flanders, 1992). The regularity condition allows basic operations of linear algebra to be implemented in a distributed fashion. A similar principle was rst proposed by Touretzky et al. (1993). They introduced an architecture called a sinusoidal array, which encodes a vector as distributed activity across a neuronal population (see equation 2.2), and they used this architecture to solve reaching and navigation tasks (Touretzky et al., 1993; Redish, Touretzky, & Wan, 1994; Redish & Touretzky, 1994). However, in their formulation, vector rotation (which is a linear transformation) was implemented in a specic way, using either shifting circuits (Touretzky et al., 1993) or repeated vector addition (Redish et al., 1994). In our framework, vector rotation can be represented by a distributed linear transformation as any morphism (see section 3.2). We derived closely related results for a broad class of tuning functions (see equation 2.1), although under more restricting hypotheses (uniform distribution of PAs). A theoretically unbiased population vector can be constructed from a nonuniformly distributed population of neurons by adjusting the distribution of tuning strength (Germain & Burnod, 1996) or tuning widths (Glasius et al., 1997). However, these methods cannot be used to release the uniformity constraint since the hypothesis of independence of PAs and parameters distribution is violated. A particular example of nonuniform distribution of PAs is their clustering along axes (Oyster & Barlow, 1967; Cohen et al., 1994; Prud’homme & Kalaska, 1994; Lacquaniti et al., 1995; Wylie et al., 1998). In this case, although the operation of the network is only exact for pure cosine tuning curves, we have shown numerically that a good approximate computation is still possible if the tuning is sufciently broad. Salinas and Abbott (1995) derived a formal rule to learn the identity mapping in dimension 1 (i.e., x ¡! x through uniformly distributed examples). Their demonstration relies on the fact that the tuning curves and synaptic connections depend on only the magnitude of the difference between preferred attributes. Our results generalize this idea to arbitrary linear mappings in any dimension. The generalized constraint is that the tuning curves and connections depend on the scalar product of preferred attributes, which includes the one-dimensional case. Salinas and Abbott (1995) also provided a solution to (x, y ) ¡! x C y in dimension 1. However, their method may not be generalizable to higher dimensions. In fact, this transformation is not a (bi)linear transformation and is not easily accounted by our theory (except in the cosine case; see also Touretzky et al., 1993). Interestingly, when one asks how information is read out from distributed maps of directional
864
Pierre Baraduc and Emmanuel Guigon
signals, vector averaging and winner-take-all are more likely decision processes than vector summation (Salzman & Newsome, 1994; Zohary, Scase, & Braddick, 1996; Groh, Born, & Newsome, 1997; Lisberger & Ferrera, 1997; Recanzone, Wurtz, & Schwarz, 1997). An important application of our theory is learning a coordinate transformation from its Jacobian. This problem can be solved formally as an ensemble of position-dependent linear mappings (Baraduc et al., 1999). However, unlike previous models (Burnod et al., 1992; Bullock et al., 1993), it is not required that position information be coded in a topographic manner. Arbitrary codes for position can be used provided that the mapping between the position and the distributed representation of the Jacobian (see equation 3.9) is correctly learned. The most interesting point is that neurons of the network display realistic ring properties, which resemble those of parietal and motor cortical neurons. These results render the theory presented here attractive for modeling sensorimotor transformations. Appendix A.1 Convergence of Q in Probability for Regular PAs. Here we show that Q D N1 EET converges in probability toward the identity matrix (up to a multiplicative constant) if the distribution of the PAs ETi is regular. The kth (1 · k · D) diagonal term of Q is
Q kk D
N 1 X E2 , N i D1 ik
which tends in probability toward sE2 when N tends to innity. Indeed, V ( Q kk ) D
¡ ¢ N V E21k 1 X 2) V(E . D ik N 2 i D1 N
6 l) of Q is Qkl D The off-diagonal element Q kl (k D
lim hQ kl i D 0
N!1
and
N 1 X EikEil ; hence, N iD 1
lim V ( Q kl ) D lim V ( EikEil ) / N D 0.
N!1
N!1
Thus, Q converges in probability toward sE2 ID . A.2 Correlated Noise. Writing Q the correlation matrix of the noise, the variance V (Eg / N ) of the read-out vector can be expressed as the rst diagonal of matrix VD
1 EQ ET . N2
Population Computation of Vectorial Transformations
865
For an independently distributed gaussian noise, Q is proportional to the identity matrix and V (Eg / N ) / 1 / N. In the case of correlated gaussian noise, £ ¤ Q D sg2 I N C c ( IN,N ¡ IN ) , and we get VD
(1 ¡ c) sg2 sE2 1 ¡c 2 T s EE ID . D N2 g N
£ ¤ p For Poisson noise, the noise correlation matrix is Q ij D (1 ¡ c)dij xi C c xi xj . If there is no correlation (c D 0), matrix V is Ediag(xi )ET / N 2 , and its ith diagonal term writes 1 X 1 xmax . x kE2ki · xmax Q Eii D N2 k N N For nonzero c, the term ³ h ´ i p c c ¡ p ¢2 Vc D 2 diag E xi xj ET D 2 E x ij N N must be added. This term is independent of N and can be evaluated numerically for a few types of distribution of ET . For instance, if the PAs are uniformly distributed in 3D space and the minimum ring rate equals zero, and assuming that the norms of the Ei and their directions are independently distributed, " 1 Vc D ckEk 4p
Z
1 ¡1
Z
p 2p
±p 1Cs
1
¡ s2 cosh,
² p 1 ¡ s2 sin h, s ds dh
#2
" Z #2 1 1 p · c xmax s 1 C s ds 2 ¡1 ·
8 c xmax , 225
hence, the upper bound of equation 3.4 (here kEk denotes the mean norm of the Ei vectors). The derivation of equation 3.5 is left to the reader. The demonstration of the properties of W E is analogous. A.3 General Cosine Tuning. In most of the sections on coding and decoding, the baseline term b was 0. We now show how the results change for a nonzero baseline. We can use an approach similar to that in section 3.1 to show that
1 X xi yi ¡! X ¢ Y C bO N
in probability as
N ! 1,
866
Pierre Baraduc and Emmanuel Guigon
where x, y are distributed representations of physical vectors X, Y, and bT b bO D lim N!1 N depends only on b. Thus, the scalar product of two vectors deduces easily from the scalar product of their distributed representations. The expression for matrix W E (see section 3.3.3), which we write W for simplicity, transforms to
W
0
DW
C
b N bN
ITN ,
where bN the mean over i of bi . It can be checked that W 0 is a projection on the afne subspace Ep C b and possesses the same properties as W . Learning a linear transformation amounts to calculating the matrix X (F T Yº C bF ) (ET Xº C bE ) T , M¤ D º
where bE and bF denote the mean activity of input and output neurons, respectively. If the training inputs satisfy the regularity condition, we have
³ ¤
T
M DF M
X º
|
³ C bF
|
´ º ( º) T
X X
{z
kM
X º
³ T
ECF M
} |
X º
{z
´ º
X
bTE
}
0
´
( Xº) T E C Nex bF bTE ,
{z
}
0 where k is a proportionality constant dened by equation 3.6. The regularity condition leads to k D r 2 Nex / DE , where r is the mean norm of input examples and DE D dim E. Hence,
M¤ /M C
r2 bF bTE . DE
Thus, appropriate mapping occurs, although there is no guarantee that the output baseline activity will equate the baseline activity of the training patterns. A.4 Nonuniform Xº:Convergence Toward the Moore-Penrose Inverse. When Fi are uniformly distributed in F, learning from the examples of a noninvertible mapping M 1 between output and input converges toward the distributed representation of its Moore-Penrose inverse.
Population Computation of Vectorial Transformations
867
Start from equation 3.8, and take Yº 2 M 1¡1 ( Ej ) . As the Moore-Penrose inverse of M 1 is zero on the kernel of M 1 , we can write Yº D M †1 Ej C Kº, where Kº 2 ker M 1 . If we assume that yº are uniformly distributed in Fp (index p is dened in section 3.3.3), then the distribution of Yº is uniform. It follows that the distribution of the Kº is symmetric with respect to zero. Hence, X Yº / M †1 Ej . M 1 Yº D Ej
The proportionality factor is identical for all j if equation 3.7 is used, which completes the proof. A.5 Distributed Dot Product. We show now that h ( X, Z) is a function of X ¢ Z, assuming that the encoded vectors are unit vectors. We note S the unit sphere of E and dene
Sa ( X ) D fU 2 S | U ¢ X D ag. Then Z Z h ( X, Z ) D Z D
a Sa ( X )
f (a) f ( Z ¢ (aX C E? ) ) dPE da
Z
f (a) a
|
Sa ( X )
f (aZ ¢ X C Z ¢ E? ) dPE da, {z } ha ( X,Z)
where E? is the projection of E on the subspace orthogonal to X. Let us dene Sau D fE 2 S | Z ¢ E? D u and X ¢ E D ag, and write Z ha ( X, Z) D Z D
Z u
f (aZ ¢ X C u )
u
f (aZ ¢ X C u ) dPu ,
dPE Sau
which depends only on X ¢ Z. This is the required result. Moreover, if f is nondecreasing (which is generally the case for a tuning function), it is immediate that h is nondecreasing. A.6 Hebbian Learning of Distributed Maps: General Case. The following derivation shows that Hebbian learning of linear mappings can still be achieved.
868
Pierre Baraduc and Emmanuel Guigon
Using equation 4.4, we obtain Z
Z Z Z f ( Xº ¢ E ) g (M Xº ¢ F ) x ( E ) F dPE dPF dPº
y ( F ) F dPF D F
E
F º
Z Z
2 4
D º F
3
Z
f (Xº ¢ E ) f ( X ¢ E ) dP E 5 g ( M Xº ¢ F ) F dPF dPº
|E
{z
}
h ( X,Xº )
Z D º
Z
2 3 Z h ( X, Xº) 4 g ( M Xº ¢ F ) F dPF 5 dPº
|F
{z
}
M Xº
h ( X, Xº) M Xº dPº.
D º
If we assume that the distribution of training examples has the same properties as the distribution of PAs, then Y D M X (using equation 4.3). This proves that the vector represented in the output activities is correct. Acknowledgments
We thank Yves Burnod for fruitful discussions, Alexandre Pouget and an anonymous reviewer for helpful comments, and Marc Maier and Pierre Fortier for revising our English. References Abbott, L., & Dayan, P. (1999). The effect of correlated variability on the accuracy of a population code. Neural Comp., 11, 91–101. Baldi, P., & Hornik, K. (1995). Learning in linear networks: A survey. IEEE Trans. Neural Netw., 6(4), 837–858. Baraduc, P., Guigon, E., & Burnod, Y. (1999). Where does the population vector of motor cortical cells point during arm reaching movements? In M. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 83–89). Cambridge, MA: MIT Press. Available on-line at: http://www.snv.jussieu.fr/guigon/nips99.pdf. Battaglia-Mayer, A., Ferraina, S., Mitsuda, T., Marconi, B., Genovesio, A., Onorati, P., Lacquaniti, F., & Caminiti, R. (2000). Early coding of reaching in the parietooccipital cortex. J. Neurophysiol., 83(4), 2374–2391. Bullock, D., Grossberg, S., & Guenther, F. (1993). A self-organizing neural model of motor equivalence reaching and tool use by a multijoint arm. J. Cogn. Neurosci., 5, 408–435.
Population Computation of Vectorial Transformations
869
Burnod, Y., Grandguillaume, P., Otto, I., Ferraina, S., Johnson, P., & Caminiti, R. (1992). Visuo-motor transformations underlying arm movements toward visual targets: A neural network model of cerebral cortical operations. J. Neurosci., 12, 1435–1453. Caminiti, R., Johnson, P., Galli, C., Ferraina, S., & Burnod, Y. (1991). Making arm movements within different parts of space: The premotor and motor cortical representation of a coordinate system for reaching to visual targets. J. Neurosci., 11, 1182–1197. Cohen, D., Prud’homme, M., & Kalaska, J. (1994). Tactile activity in primate primary somatosensory cortex during active arm movements: Correlation with receptive eld properties. J. Neurophysiol., 71, 161–172. Crowe, A., Porrill, J., & Prescott, T. (1998). Kinematic coordination of reach and balance. J. Mot. Behav., 30(3), 217–233. Davis, P. (1979). Circulant matrices. New York: Wiley. Douglas, R., Koch, C., Mahowald, M., Martin, K., & Suarez, H. (1995). Recurrent excitation in neocortical circuits. Science, 269, 981–985. Gawne, T., & Richmond, B. (1993). How independent are the messages carried by adjacent inferior temporal cortical neurons? J. Neurosci., 13, 2758– 2771. Georgopoulos, A. (1996). On the translation of directional motor cortical commands to activation of muscles via spinal interneuronal systems. Cogn. Brain Res., 3(2), 151–155. Georgopoulos, A., Kettner, R., & Schwartz, A. (1988). Primate motor cortex and free arm movements to visual targets in 3-dimensional space. II. Coding of the direction of movement by a neuronal population. J. Neurosci., 8, 2928–2937. Georgopoulos, A., Schwartz, A., & Kettner, R. (1986). Neuronal population coding of movement direction. Science, 233, 1416–1419. Germain, P., & Burnod, Y. (1996). Computational properties and autoorganization of a population of cortical neurons. In Proc. International Conference on Neural Networks, ICNN’96 (pp. 712–717). Piscataway, NJ: IEEE. Glasius, R., Komoda, A., & Gielen, C. (1997). The population vector, an unbiased estimator for non-uniformly distributed neural maps. Neural Netw., 10, 1571– 1582. Groh, J., Born, R., & Newsome, W. (1997). How is a sensory map read out? Effects of microstimulation in visual area MT on saccades and smooth pursuit eye movements. J. Neurosci., 17(11), 4312–4330. Grossberg, S., & Kuperstein, M. (1989). Neural dynamics of adaptive sensory-motor control (Exp. ed.). Elmsford, NY: Pergamon Press. Hinton, G. (1984). Parallel computations for controlling an arm. J. Mot. Behav., 16(2), 171–194. Hinton, G. (1992). How neural networks learn from experience. Sci. Am., 267(3),145–151. Hubel, D., & Wiesel, T. (1962). Receptive elds, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. (Lond.), 160, 106– 154. Kuperstein, M. (1988). Neural model of adaptive hand-eye coordination for single postures. Science, 239, 1308–1311.
870
Pierre Baraduc and Emmanuel Guigon
Lacquaniti, F., Guigon, E., Bianchi, L., Ferraina, S., & Caminiti, R. (1995). Representing spatial information for limb movement: Role of area 5 in the monkey. Cereb. Cortex, 5(5), 391–409. Lisberger, S., & Ferrera, V. (1997). Vector averaging for smooth pursuit eye movements initiated by two moving targets in monkeys. J. Neurosci., 17(19), 7490– 7502. Mardia, K. (1972). Statistics of directional data. London: Academic Press. Mussa-Ivaldi, F. (1988). Do neurons in the motor cortex encode movement direction? An alternative hypothesis. Neurosci. Lett., 91, 106–111. Mussa-Ivaldi, F., Morasso, P., & Zaccaria, R. (1988). Kinematic networks. A distributed model for representing and regularizing motor redundancy. Biol. Cybern., 60, 1–16. Oyster, C., & Barlow, H. (1967). Direction-selective units in rabbit retina: Distribution of preferred directions. Science, 155, 841–842. Paradiso, M. (1988). A theory for use of visual orientation information which exploits the columnar structure of striate cortex. Biol. Cybern., 58, 35– 49. Pouget, A., Zhang, K., Deneve, S., & Latham, P. (1998). Statistically efcient estimation using population coding. Neural Comput., 10(2), 373–401. Prud’homme, M., & Kalaska, J. (1994). Proprioceptive activity in primate primary somatosensory cortex during active arm reaching movements. J. Neurophysiol., 72(5), 2280–2301. Recanzone, G., Wurtz, R., & Schwarz, U. (1997). Responses of MT and MST neurons to one and two moving objects in the receptive eld. J. Neurophysiol., 78(6), 2904–2915. Redding, G., & Wallace, B. (1997). Adaptive spatial alignment. Hillsdale, NJ: Erlbaum. Redish, A., & Touretzky, D. (1994). The reaching task: Evidence for vector arithmetic in the motor system? Biol. Cybern., 71(4), 307–317. Redish, A., Touretzky, D., & Wan, H. (1994). The sinusoidal array: A theory of representation for spatial vectors. In F. Eeckman (Ed.), Computation and neural systems (pp. 269–274). Boston: Kluwer. Rosa, M., & Schmid, L. (1995). Magnication factors, receptive eld image and point-image size in the superior colliculus of ying foxes: Comparison with primary visual cortex. Exp. Brain Res., 102, 551–556. Salinas, E., & Abbott, L. (1994). Vector reconstruction from ring rates. J. Comput. Neurosci., 1, 89–107. Salinas, E., & Abbott, L. (1995). Transfer of coded information from sensory to motor networks. J. Neurosci., 15(10), 6461–6474. Salinas, E., & Abbott, L. (1996). A model of multiplicative neural responses in parietal cortex. Proc. Natl. Acad. Sci. U.S.A., 93(21), 11956–11961. Salzman, C., & Newsome, W. (1994). Neural mechanisms for forming a perceptual decision. Science, 264, 231–237. Sanger, T. (1994). Theoretical considerations for the analysis of population coding in motor cortex. Neural Comput., 6, 29–37. Schwartz, A., Kettner, R., & Georgopoulos, A. (1988). Primate motor cortex and free arm movements to visual targets in three-dimensional space. I. Relations
Population Computation of Vectorial Transformations
871
between single cell discharge and direction of movement. J. Neurosci., 8, 2913– 2927. Seung, H., & Sompolinsky, H. (1993). Simple models for reading neuronal population codes. Proc. Natl. Acad. Sci. U.S.A., 90, 10749–10753. Snippe, H. (1996). Parameter extraction from population codes: A critical assessment. Neural Comput., 8(3), 511–529. Soechting, J., & Flanders, M. (1992). Moving in three-dimensional space: Frames of reference, vector, and coordinate systems. Annu. Rev. Neurosci., 15, 167–191. Strang, G. (1988). Linear algebra and its applications (3rd ed.). San Diego: Harcourt Brace Jovanovich. Touretzky, D., Redish, A., & Wan, H. (1993).Neural representation of space using sinusoidal arrays. Neural Comput., 5, 869–884. van Gisbergen, J., van Opstal, A., & Tax, A. (1987). Collicular ensemble coding of saccades based on vector summation. Neuroscience, 21, 541–555. Wylie, D., Bischof, W., & Frost, B. (1998). Common reference frame for neural coding of translational and rotational optic ow. Nature, 392, 278–282. Yang, H., & Dillon, T. (1994). Exponential stability and oscillation of Hopeld graded response neural network. IEEE Trans. Neural Netw., 5(5), 719–729. Zemel, R., & Hinton, G. (1995). Learning population codes by minimizing description length. Neural Comput., 7(3), 549–564. Zhang, K. (1996). Representation of spatial orientation by the intrinsic dynamics of the head-direction cell ensemble: A theory. J. Neurosci., 16(6), 2112–2126. Zhang, K., Ginzburg, I., McNaughton, B., & Sejnowski, T. (1998). Interpreting neuronal population activity by reconstruction: Unied framework with application to hippocampal place cells. J. Neurophysiol., 79(2), 1017–1044. Zohary, E., Scase, M., & Braddick, O. (1996). Integration across directions in dynamic random dot displays: Vector summation or winner take all? Vision Res., 36(15), 2321–2331. Zohary, E., Shadlen, M., & Newsome, W. (1994). Correlated neuronal discharge rate and its implications for psychophysical performance. Nature, 370, 140– 143. Received February 17, 2000; accepted July 5, 2001.
LETTER
Communicated by Michael Eisele
Redistribution of Synaptic Efcacy Supports Stable Pattern Learning in Neural Networks Gail A. Carpenter
[email protected] Boriana L. Milenova
[email protected] Department of Cognitive and Neural Systems, Boston University, Boston, Massachusetts 02215, U.S.A.
Markram and Tsodyks, by showing that the elevated synaptic efcacy observed with single-pulse long-term potentiation (LTP) measurements disappears with higher-frequency test pulses, have critically challenged the conventional assumption that LTP reects a general gain increase. This observed change in frequency dependence during synaptic potentiation is called redistribution of synaptic efcacy (RSE). RSE is here seen as the local realization of a global design principle in a neural network for pattern coding. The underlying computational model posits an adaptive threshold rather than a multiplicative weight as the elementary unit of long-term memory. A distributed instar learning law allows thresholds to increase only monotonically, but adaptation has a bidirectional effect on the model postsynaptic potential. At each synapse, threshold increases implement pattern selectivity via a frequency-dependent signal component, while a complementary frequency-independent component nonspecically strengthens the path. This synaptic balance produces changes in frequency dependence that are robustly similar to those observed by Markram and Tsodyks. The network design therefore suggests a functional purpose for RSE, which, by helping to bound total memory change, supports a distributed coding scheme that is stable with fast as well as slow learning. Multiplicative weights have served as a cornerstone for models of physiological data and neural systems for decades. Although the model discussed here does not implement detailed physiology of synaptic transmission, its new learning laws operate in a network architecture that suggests how recently discovered synaptic computations such as RSE may help produce new network capabilities such as learning that is fast, stable, and distributed.
Neural Computation 14, 873–888 (2002)
° c 2002 Massachusetts Institute of Technology
874
Gail A. Carpenter and Boriana L. Milenova
1 Introduction
The traditional experimental interpretation of long-term potentiation (LTP) as a model of synaptic plasticity is based on a fundamental hypothesis: “Changes in the amplitude of synaptic responses evoked by single-shock extracellular electrical stimulation of presynaptic bres are usually considered to reect a change in the gain of synaptic signals, and are the most frequently used measure for evaluating synaptic plasticity” (Markram & Tsodyks, 1996, p. 807). LTP experiments tested only with low-frequency presynaptic inputs implicitly assume that these observations may be extrapolated to higher frequencies. Paired action-potential experiments by Markram and Tsodyks (1996) call into question the LTP gain-change hypothesis by demonstrating that adaptive changes in synaptic efcacy can depend dramatically on the frequency of the presynaptic test pulses used to probe these changes. In that preparation, following an interval of preand postsynaptic pairing, neocortical pyramidal neurons are seen to exhibit LTP, with the amplitude of the post-pairing response to a single test pulse elevated to 166% of the pre-pairing response. If LTP were a manifestation of a synaptic gain increase, the response to each higher-frequency test pulse would also be 166% of the pre-pairing response to the same presynaptic frequency. Although the Markram–Tsodyks data do show an amplied response to the initial spike in a test train (EPSPinit), the degree of enhancement of the stationary response (EPSPstat ) declines steeply as test pulse frequency increases (see Figure 1). In fact, post-pairing amplication of EPSPstat disappears altogether for 23 Hz test trains and then, remarkably, reverses sign, with test trains of 30–40 Hz producing post-pairing stationary response amplitudes that are less than 90% the size of pre-pairing amplitudes. Pairing is thus shown to induce a redistribution rather than a uniform enhancement of synaptic efcacy. As Markram, Pikus, Gupta, and Tsodyks (1998) point out, redistribution of synaptic efcacy has profound implications for modeling as well as experimentation: “Incorporating frequency-dependent synaptic transmission into articial neural networks reveals that the function of synapses within neural networks is exceedingly more complex than previously imagined” (p. 497). Neural modelers have long been aware that synaptic transmission may exhibit frequency dependence (Abbott, Varela, Sen, & Nelson, 1997; Carpenter & Grossberg, 1990; Grossberg, 1968), but most network models have not so far needed this feature to achieve their functional goals. Rather, the assumption that synaptic gains, or multiplicative weights, are xed on the timescale of synaptic transmission has served as a useful cornerstone for models of adaptive neural processes and related articial neural network systems. Even models that hypothesize synaptic frequency dependence would still typically have predicted the constant upper dashed line in Figure 1 (see section 2.1), rather than the change in frequency dependence
Neural Networks and Synaptic Efcacy
875
amplitude (% of control)
200
180
160
140
EPSP
stat
120
100
80 0
10
20
30
40
Presynaptic spike frequency (Hz) Figure 1: Relative amplitude of the stationary postsynaptic potential EPSP stat as a function of presynaptic spike frequency (I) (adapted from Markram & Tsodyks, 1996, Figure 3c, p. 809). In the Markram–Tsodyks pairing paradigm, sufcient current to evoke 4–8 spikes was injected, pre- and post-, for 20 msec; this procedure was repeated every 20 sec for 10 min. Data points show the EPSP stat after pairing as a percentage of the control EPSPstat before pairing, for I D 2, 5, 10, 23, 30, 40 Hz; plus the low-frequency “single-spike” point, shown as a weighted average of the measured data: 2 £ 0.25 and 17 £ 0.067 Hz. If pairing had produced no adaptation, EPSP stat would be a function of I that was unaffected by pairing, as represented by the lower dashed line (100% of control). If pairing had caused an increase in a gain, or multiplicative weight, then EPSP stat would equal the gain times a function of I, which would produce the upper dashed line (166% of control). Markram and t their data with an ¡ Tsodyks ¢ ¡ exponential curve, approximately (1 C 0.104[e the neutral point at I D 14.5 Hz.
I¡14.5 7.23
¡ 1]) 100%, which crosses
observed in the Markram-Tsodyks redistribution of synaptic efcacy (RSE) experiments. A “bottom-up” modeling approach might now graft a new process, such as redistribution of synaptic efcacy, onto an existing system. While such a step would add complexity to the model’s dynamic repertoire, it may be difcult to use this approach to gain insight into the functional advantages of
876
Gail A. Carpenter and Boriana L. Milenova
the added element. Indeed, adding the Markram–Tsodyks effect to an existing network model of pattern learning would be expected to alter drastically the dynamics of input coding—but what could be the benet of such an addition? A priori, such a modication even appears to be counterproductive, since learning in the new system would seem to reduce pattern discrimination by compressing input differences and favoring only low-frequency inputs. A neural network model called distributed ART (dART) (Carpenter, 1996, 1997; Carpenter, Milenova, & Noeske, 1998) features RSE at the local synaptic level as a consequence of implementing system design goals at the pattern processing level. Achieving these global capabilities, not the tting of local physiological data, was the original modeling goal. This “top-down” approach to understanding the functional role of learned changes in synaptic potentiation suggests by example how the apparently paradoxical phenomenon of RSE may actually be precisely the element needed to solve a critical pattern coding problem at a higher processing level. The dART network seeks to combine the advantages of multilayer perceptrons, including noise tolerance and code compression, with the complementary advantages of adaptive resonance theory (ART) networks (Carpenter & Grossberg, 1987, 1993; Carpenter, Grossberg, & Reynolds, 1991; Carpenter, Grossberg, Markuzon, Reynolds, & Rosen, 1992). ART and dART models employ competitive learning schemes for code selection, and both are designed to stabilize learning. However, because ART networks use a classical steepest-descent paradigm called instar learning (Grossberg, 1972), these systems require winner-take-all coding to maintain memory stability with fast learning. A new learning law called the distributed instar (dInstar) (see section 2.1) allows dART code representations to be distributed across any number of network nodes. The dynamic behavior of an individual dART synapse is seen in the context of its role in stabilizing distributed pattern learning rather than as a primary hypothesis. RSE here reects a trade-off between changes in frequency-dependent and frequency- independent postsynaptic signal components, which support a trade-off between pattern selectivity and nonspecic path strengthening at the network level (see Figure 2). Models that implement distributed coding via gain adaptation alone tend to suffer catastrophic forgetting and require slow or limited learning. In dART, each increase in frequency-independent synaptic efcacy is balanced by a proportional decrease in frequency-dependent efcacy. With each frequencydependent element assumed to be stronger than its paired frequency-independent element, the net result of learning is redistribution rather than nonspecic enhancement of synaptic efcacy. The system uses this mechanism to achieve the goal of a typical competitive learning scheme, enhancing network response to a given pattern while suppressing the response to mismatched patterns. At the same time, the dART network learning laws are designed to preserve prior codes. They do so by formally replacing the mul-
Neural Networks and Synaptic Efcacy
877
tiplicative weight with a dynamic weight (Carpenter, 1994), equal to the rectied difference between target node activation and an adaptive threshold, which embodies the long-term memory of the system. The dynamic weight permits adaptation only at the most active coding nodes, which are limited in number due to competition at the target eld. Replacing the multiplicative weight with an adaptive threshold as the unit of long-term memory thus produces a coding system that may be characterized as quasi-localist (Carpenter, 2001) rather than localist (winner-take-all) or fully distributed. Adaptive thresholds, which are initially zero, become increasingly resistant to change as they become larger, a property that is essential for code stability. Both ART and dART also employ a preprocessing step called complement coding (Carpenter, Grossberg, & Rosen, 1991), which presents to the learning system both the original external input pattern and its complement. The system thus allocates two thresholds for coding each component of the original input, a device that is analogous to on-cell/off-cell coding in the early visual system. Each threshold can only increase, and, as in the Markram-Tsodyks RSE experiments, each model neuron can learn to enhance only low-frequency signals. Nevertheless, by treating high-frequency and low-frequency component of the original input pattern symmetrically, complement coding allows the network to encode a full range of input features. Elements of the dART network that are directly relevant to the discussion of Markram-Tsodyks RSE during pairing experiments will now be dened quantitatively. 2 Results 2.1 Distributed ART Model Equations. A simple, plausible model of synaptic transmission might hypothesize a postsynaptic depolarization T in response to a presynaptic ring rate I as T D weff ¤ I, where the effective weight weff might decrease as the frequency I increases. Specically, if:
T D [ f ( I ) ¤ w] ¤ I, where w is constant on a short timescale, then the ratio of T before versus after pairing would be independent of I. An LTP experiment that employs only single-shock test pulses relies on such a hypothesis for in vivo extrapolation and therefore implicitlypredicts the upper dashed line in Figure 1. However, this synaptic computation is completely at odds with the Markram-Tsodyks measurements of adaptive change in frequency dependence. The net postsynaptic depolarization signal T at a dART model synapse is a function of two formal components with dual computational properties: a frequency-dependent component S, which is a function of the current presynaptic input I, and a frequency-independent component H, which is independent of I. Both components depend on the postsynaptic voltage y
878
Gail A. Carpenter and Boriana L. Milenova
DUAL COMPUTATIONAL PROPERTIES
FREQUENCYDEPENDENT
FREQUENCYINDEPENDENT
shrink to fit amplify
atrophy due to disuse
gain
SYNAPTIC DYNAMIC BALANCE Figure 2: During dART learning, active coding nodes tend simultaneously to become more selective with respect to a specic pattern and to become more excitable with respect to all patterns. This network-level trade-off is realized by a synaptic-level dynamic balance between frequency-dependent and frequency-independent signal components. During learning, “disused” frequency-dependent elements, at synapses where the dynamic weight exceeds the input, are converted to frequency-independent elements. This conversion will strengthen the signal transmitted by the same path input (or by a smaller input), which will subsequently have the same frequency-dependent component but a larger frequency-independent component. Network dynamics also require that an active frequency-dependent (pattern-specic) component contribute more than the equivalent frequency-independent (nonspecic) component, which is realized as the hypothesis that parameter a is less than 1 in equation 2.1. This hypothesis ensures that among those coding nodes that would produce no new learning for a given input pattern, nodes with learned patterns that most closely match the input are most strongly activated.
and the adaptive threshold t : Frequency dependent: Frequency independent: Total postsynaptic signal:
S D I ^ [y ¡ t ] C H D y^t T D S C (1 ¡ a) H D I ^ [y ¡ t ] C C (1 ¡ a) y ^ t.
(2.1)
Neural Networks and Synaptic Efcacy
879
In 2.1, a ^ b ´ minfa, bg and [a] C ´ maxfa, 0g. Parameter a is assumed to be between 0 and 1, corresponding to the network hypothesis that the patternspecic component contributes more to postsynaptic activation than the nonspecic component, all other things being equal. The dynamic weight, dened formally as [y ¡ t ] C , species an upper bound on the size of S; for smaller I, the frequency-dependent component is directly proportional to I. Note that this model does not assign a specic physiological interpretation to the postsynaptic signal T. In particular, T cannot simply be proportional to the transmitted signal, since T does not equal 0 when I D 0. The adaptive threshold t, initially 0, increases monotonically during learning, according to the dInstar learning law: dInstar:
£ ¤C d t D [y ¡ t ] C ¡ I dt C D [y ¡ t ¡ I] .
(2.2)
The distributed instar represents a principle of atrophy due to disuse, whereby a dynamic weight that exceeds the current input “shrinks to t” that input (see Figure 2). When the coding node is embedded in a competitive network, the bound on total network activation across the target eld causes dynamic weights to impose an inherited bound on the total learned change any given input can induce, with fast as well as slow learning. Note that t remains constant if y is small or t is large and that d t D [y ¡ t ] C ¡ [y ¡ t ] C ^ I dt C D y ¡ y ^ t ¡ [y ¡ t ] ^ I
D y ¡ H ¡ S.
When a threshold increases, the frequency-independent, or nonspecic, component H (see equation 2.1) becomes larger for all subsequent inputs, but the input-specic component S becomes more selective. For a highfrequency input, a nonspecically increased component is neutralized by a decreased frequency-dependent component. The net computational effect of a threshold increase (e.g., due to pairing) is an enhancement of the total signal T subsequently produced by small presynaptic inputs, but a smaller enhancement, or even a reduction, of the total signal produced by large inputs. 2.2 Distributed ART Model Predicts Redistribution of Synaptic Efcacy. Figure 3 illustrates the frequency-dependent and frequency-inde-
pendent components of the postsynaptic signal T and shows how these two competing elements combine to produce the change in frequency dependence observed during pairing experiments. In this example, model elements, dened by equation 2.1, are taken to be piecewise linear, although
880
Gail A. Carpenter and Boriana L. Milenova
this choice is not unique. In fact, the general dART model allows a broad range of form factors that satisfy qualitative hypotheses. The model presented here has been chosen for minimality, including only those components needed to produce computational capabilities, and for simplicity of functional form. Throughout, the superscript b ( before ) denotes values measured before the pairing experiment, and the superscript a (after ) denotes values measured after the pairing experiment. The graphs show each system variable as a function of the presynaptic test frequency (I). Variable I is scaled by a factor (IN Hz), which converts the dimensionless input (see equation 2.1) to frequency in the experimental range. The dimensionless model input corresponds to the experimental test frequency divided by I.N Frequency-dependent S
A
Sb Sa
S a=S b Frequency-independent
B
a b
Combined T=S+(1-
C
Ta Tb 200
Ta/Tb
D 160 120 80
20.3
0
10
20
25.8
30
Presynaptic spike frequency (Hz)
40
Neural Networks and Synaptic Efcacy
881
In the dART network, postsynaptic nodes are embedded in a eld where strong competition typically holds a pattern of activation as a working memory code that is largely insensitive to uctuations of the external inputs. When a new input arrives, an external reset signal briey overrides internal competitive interactions, which allows the new pattern to determine its own unbiased code. This reset process is modeled by momentarily setting all postsynaptic activations y D 1. The resulting initial signals T then lock in the subsequent activation pattern, as a function of the internal dynamics of the competitive eld. Thereafter, signal components S and H depend on y, which is small at most nodes due to normalization of total activation across the eld. The Markram-Tsodyks experiments use isolated cells, so network
Figure 3: Facing page. dART model and Markram–Tsodyks data. (A) The dART postsynaptic frequency-dependent component S increases linearly with the presynaptic test spike frequency I, up to a saturation point. During pairing, the model adaptive threshold t increases, and the saturation point of the graph of S is proportional to (1 ¡ t ) . The saturation point therefore declines as the coding node becomes more selective. Pairing does not alter the frequency-dependent response to low- frequency inputs: Sa D Sb for small I. For high-frequency inputs, Sa is smaller than Sb by a quantity D , which is proportional to the amount by which t has increased during pairing. (B) The dART frequency-independent component H, which is a constant function of the presynaptic input I, increases by D during pairing. (C) Combined postsynaptic signal T D S C (1 ¡ a) H, where 0 < a < 1. At low presynaptic frequencies, pairing causes T to increase ( T a D Tb C (1 ¡ a) D ) , because of the increase in the frequency-independent signal component H. At high presynaptic frequencies, pairing causes T to decrease ( T a D T b ¡ aD ) . (D) For presynaptic spike frequencies below the post-pairing saturation point of Sa , T a is greater than T b . For frequencies above the pre-pairing saturation point of Sb , T a is less than T b . The interval of intermediate frequencies contains the neutral point where Ta D T b . Parameters for the dART model were estimated by minimizing the chi-squared ´2 N ³ X yi ¡ yO i 2 (Press, Teukolski, Vetterling, & Flannery, 1994) statistic: Â D , si iD 1 where yi and si are the mean value and standard deviation of the ith measurement point, respectively, while yO i is the model’s prediction for that point. Four parameters were used: threshold before pairing (t b D 0.225) , threshold after pairing (t a D 0.39) , a presynaptic input scale (IN D 33.28 Hz), and the weighting coefcient (a D 0.6) , which determines the contribution of the frequencydependent component S relative to the frequency-independent component H. The components of the dimensionless postsynaptic signal T D S C (1 ¡ a) H, for a system with a single node in the target eld ( y D 1) , are S D ( I / IN) ^ (1 ¡ t ) and H D t . The dART model provides a good t of the experimental data on changes in synaptic frequency dependence due to pairing ( Â2 (3) D 1.085, p D 0.78) .
882
Gail A. Carpenter and Boriana L. Milenova
properties are not tested, and Figures 3 and 4 plot dART model equations 2.1 with y D 1. Figure 3A shows the frequency-dependent component of the postsynaptic signal before pairing (Sb ) and after pairing (Sa ). The frequency-dependent component is directly proportional to I, up to a saturation point, which is proportional to (1 ¡ t ). Tsodyks and Markram (1997) have observed a similar phenomenon: “The limiting frequencies were between 10 and 25 Hz. . . . Above the limiting frequency the average postsynaptic depolarization from resting membrane potential saturates as presynaptic ring rates increase” (p. 720). The existence of such a limiting frequency conrms a prediction of the phenomenological model of synaptic transmission proposed by Tsodyks and Markram (1997), as well as the prediction of distributed ART (Carpenter, 1996, 1997). The dART model also predicts that pairing lowers the saturation point as the frequency-dependent component becomes more selective. Figure 3B illustrates that the frequency-independent component in the dART model is independent of I and that it increases during training. Moreover, the increase in this component ( D ´ H a ¡ H b D t a ¡ t b ) balances the decrease in the frequency-dependent component at large I, where Sb ¡ Sa D D . Figure 3C shows how the frequency-dependent and frequency- independent components combine in the dART model to form the net postsynaptic signal T. Using the simplest form factor, the model synaptic signal is taken to be a linear combination of the two components: T D S C (1 ¡ a) H (see equation 2.1). For small I (below the post-pairing saturation point of Sa ), pairing causes T to increase, since S remains constant and H increases. For large I (above the pre-pairing saturation point of Sb ), pairing causes T to decrease: because (1 ¡ a) < 1, the frequency-independent increase is more than offset by the frequency-dependent decrease. The neutral frequency, at which the test pulse I produces the same postsynaptic depolarization before and after pairing, lies between these two intervals. Figure 3D combines the graphs in Figure 3C to replicate the MarkramTsodyks data on changes in frequency dependence, which are redrawn on this plot. The graph of T a / T b is divided into three intervals, determined by the saturation points of S before pairing ( I D IN(1 ¡ t b ) D 25.8 Hz) and after pairing ( I D IN(1 ¡ t a ) D 20.3 Hz) (see Figure 3A). The neutral frequency lies between these two values. System parameters of the dART model were chosen, in Figure 3, to obtain a quantitative t to the Markram-Tsodyks (1996) results concerning changes in synaptic potentiation, before pairing versus after pairing. In that preparation, the data exhibit the reversal phenomenon where, for high-frequency test pulses, post-pairing synaptic efcacy falls below its pre-pairing value. Note that dART system parameters could also be chosen to t data that might show a reduction, but not a reversal, of synaptic efcacy. This might occur, for example, if the test pulse frequency of the theoretical reversal point
Neural Networks and Synaptic Efcacy
883
200
180
Ta/Tb
(%)
160
140
120
100
80 0
10
20
30
40
Presynaptic spike frequency (Hz) Figure 4: Transitional RSE ratios. The dART model predicts that if postsynaptic responses were measured at intermediate numbers of pairing intervals, the location of the neutral point, where pairing leaves the ratio T a / T b unchanged, would move to the left on the graph. That is, the cross-over point would occur at lower frequencies I.
were beyond the physiological range. Across a wide parameter range, the qualitative properties illustrated here are robust and intrinsic to the internal mechanisms of the dART model. Analysis of the function T D S C (1 ¡a) H suggests how this postsynaptic signal would vary with presynaptic spike frequency if responses to test pulses were measured at transitional points in the adaptation process (see Figure 4), after fewer than the 30 pairing intervals used to produce the original data. In particular, the saturation point where the curve modeling T a / T b attens out at high presynaptic spike frequency depends on only the state of the system before pairing, so this location remains constant as adaptation proceeds. On the other hand, as the number of pairing intervals increases, the dART model predicts that the neutral point, where the curve crosses the 100% line and T a D T b , moves progressively to the left. That is, as the degree of LTP amplication of low-frequency inputs grows, the set of presynaptic frequencies that produce any increased synaptic efcacy shrinks.
884
Gail A. Carpenter and Boriana L. Milenova
3 Discussion 3.1 Redistribution of Synaptic Efcacy Supports Stable Pattern Learning. Markram and Tsodyks (1996) report measurements of the initial, tran-
sient, and stationary components of the excitatory postsynaptic potential in neocortical pyramidal neurons, bringing to a traditional LTP pairing paradigm a set of nontraditional test stimuli that measure postsynaptic responses at various presynaptic input frequencies. The dART model analysis of these experiments focuses on how the stationary component of the postsynaptic response is modied by learning. This analysis places aspects of the single-cell observations in the context of a large-scale neural network for stable pattern learning. While classical multiplicative models are considered highly plausible, having succeeded in organizing and promoting the understanding of volumes of physiological data, nearly all such models failed to predict adaptive changes in frequency dependence. Learning laws in the dART model operate on a principle of atrophy due to disuse, which allows the network to mold parallel distributed pattern representations while protecting stored memories. The dynamic balance of competing postsynaptic computational components at each synapse dynamically limits memory change, enabling stable fast learning with distributed code representations in a real-time neural model. To date, other competitive learning systems have not realized this combination of computational capabilities. Although dART model elements do not attempt to t detailed physiological measurements of synaptic signal components, RSE is the computational element that sustains stable distributed pattern coding in the network. As described in section 2.2, the network synapse balances an adaptive increase in a frequency-independent component of the postsynaptic signal against a corresponding frequency-dependent decrease. Local models of synaptic transmission designed to t the Markram-Tsodyks data are reviewed in section 3.2. These models do not show, however, how adaptive changes in frequency dependence might be implemented in a network with useful computational functions. In the dART model, the synaptic location of a frequency-independent bias term, realized as an adaptive threshold, leads to dual postsynaptic computations that mimic observed changes in postsynaptic frequency dependence, before versus after pairing. However, producing this effect was not a primary design goal; in fact, model specication preceded the data report. Rather, replication of certain aspects of the Markram-Tsodyks experiments was a secondary result of seeking to design a distributed neural network that does not suffer catastrophic forgetting. The dInstar learning law (see equation 2.2) allows thresholds to change only at highly active coding nodes. This rule stabilizes memory because total activation across the target eld is assumed to be bounded, so most of the system’s memory traces remain constant in response to a typical input pattern. Dening
Neural Networks and Synaptic Efcacy
885
long-term change in terms of dynamic weights thus allows signicant new information to be encoded quickly at any future time, but also protects the network’s previous memories at any given time. In contrast, most neural networks with distributed codes suffer unselective forgetting unless they operate with restrictions such as slow learning. The rst goal of the dART network is the coding process itself. In particular, as in a typical coding system, two functionally distinct input patterns need to be able to activate distinct patterns at the coding eld. The network accomplishes this by shrinking large dynamic weights just enough to t the current pattern (see Figure 2). Increased thresholds enhance the net excitatory signal transmitted by this input pattern to currently active coding nodes because learning leaves all frequency-dependent responses to this input unchanged while causing frequency-independent components to increase wherever thresholds increase. On the other hand, increased thresholds can depress the postsynaptic signal produced by a different input pattern, since a higher threshold in a high-frequency path would now cause the frequencydependent component to be depressed relative to its previous size. If this depression is great enough, it can outweigh the nonspecic enhancement of the frequency-independent component. Local RSE, as illustrated in Figure 3, is an epiphenomenon of this global pattern learning dynamic. A learning process represented as a simple gain increase would only enhance network responses. Recognizing the need for balance, models dating back at least to the McCulloch-Pitts neuron (McCulloch & Pitts, 1943) have included a nodal bias term. In multilayer perceptrons such as backpropagation (Rosenblatt, 1958, 1962; Werbos, 1974; Rumelhart, Hinton, & Williams, 1986), a single bias weight is trained along with all the patternspecic weights converging on a network node. The dART model differs from these systems in that each synapse includes both frequency-dependent (pattern-specic) and frequency-independent (nonspecic bias) processes. All synapses then contribute to a net nodal bias. The total increased frequency-independent bias is locally tied to increased pattern selectivity. Although the adaptation process is unidirectional, complement coding, by representing both the original input pattern and its complement, provides a full dynamic range of coding computations.
3.2 Local Models of the Markram-Tsodyks Data. During dInstar learning, the decrease in the frequency-dependent postsynaptic component S balances the increase in the frequency-independent component H. These qualitative properties subserve necessary network computations. However, model perturbations may have similar computational properties, and system components do not uniquely imply a physical model. Models that focus more on the Markram- Tsodyks paradigm with respect to the detailed biophysics of the local synapse, including transient dynamics, are now reviewed.
886
Gail A. Carpenter and Boriana L. Milenova
In the Tsodyks-Markram (1997) model, the limiting frequency, beyond which EPSPstat saturates, decreases as a depletion rate parameter USE (utilization of synaptic efcacy) increases. In this model, as in dART, pairing lowers the saturation point (see Figure 3C). Tsodyks and Markram discuss changes in presynaptic release probabilities as one possible interpretation of system parameters such as USE . Abbott et al. (1997) also model some of the same experimental phenomena discussed by Tsodyks and Markram, focusing on short-term synaptic depression. In other model analyses of synaptic efcacy, Markram, Pikus, Gupta, & Tsodyks (1998) and Markram, Wang, and Tsodyks (1998) add a facilitating term to their 1997 model in order to investigate differential signaling arising from a single axonal source. Tsodyks, Pawelzik, and Markram (1998) investigate the implications of these synaptic model variations for a large-scale neural network. Using a mean-eld approximation, they “show that the dynamics of synaptic transmission results in complex sets of regular and irregular regimes of network activity” (p. 821). However, their network is not constructed to carry out any specied function; neither is it adaptive. Tsodyks et al. (1998) conclude “An important challenge for the proposed formulation remains in analyzing the inuence of the synaptic dynamics on the performance of other, computationally more instructive neural network models. Work in this direction is in progress” (pp. 831–832). Because the Markram-Tsodyks RSE data follow from the intrinsic functional design goals of a complete system, the dART neural network model begins to meet this challenge. Acknowledgments
This research was supported by grants from the Air Force Ofce of Scientic Research (AFOSR F49620-01-1-0397), the Ofce of Naval Research and the Defense Advanced Research Projects Agency (ONR N00014-95-1-0409 and ONR N00014-1-95-0657), and the National Institutes of Health (NIH 20-3164304-5). References Abbott, L. F., Varela, J. A., Sen, K., & Nelson, S. B. (1997). Synaptic depression and cortical gain control. Science, 275, 220–224. Carpenter, G. A. (1994). A distributed outstar network for spatial pattern learning. Neural Networks, 7, 159–168. Carpenter, G. A. (1996). Distributed activation, search, and learning by ART and ARTMAP neural networks. In Proceedings of the International Conference on Neural Networks (ICNN’96):Plenary, Panel and Special Sessions (pp. 244–249). Piscataway, NJ: IEEE Press. Carpenter, G. A. (1997). Distributed learning, recognition, and prediction by ART and ARTMAP neural networks. Neural Networks, 10, 1473–1494.
Neural Networks and Synaptic Efcacy
887
Carpenter, G. A. (2001). Neural network models of learning and memory: Leading questions and an emerging framework. Trends in Cognitive Sciences, 5, 114–118. Carpenter, G. A., & Grossberg, S. (1987). A massively parallel architecture for a self-organizing neural pattern recognition machine. Computer Vision, Graphics, and Image Processing, 37, 54–115. Carpenter, G. A., & Grossberg, S. (1990). ART 3: Hierarchical search using chemical transmitters in self-organizing pattern recognition architectures. Neural Networks, 3, 129–152. Carpenter, G. A., & Grossberg, S. (1993). Normal and amnesic learning, recognition, and memory by a neural model of cortico-hippocampal interactions. Trends in Neuroscience, 16, 131–137. Carpenter, G. A., Grossberg, S., Markuzon, N., Reynolds, J. H., & Rosen, D. B. (1992). Fuzzy ARTMAP: A neural network architecture for incremental supervised learning of analog multidimensional maps. IEEE Transactions on Neural Networks, 3, 698–713. Carpenter, G. A., Grossberg, S., & Reynolds, J. H. (1991). ARTMAP: Supervised real-time learning and classication of nonstationary data by a selforganizing neural network. Neural Networks, 4, 565–588. Carpenter, G. A., Grossberg, S., & Rosen, D. B. (1991). Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system. Neural Networks, 4, 759–771. Carpenter, G. A., Milenova, B. L., & Noeske, B. W. (1998). Distributed ARTMAP: A neural network for fast distributed supervised learning. Neural Networks, 11, 793–813. Grossberg, S. (1968). Some physiological and biochemical consequences of psychological postulates. Proc. Natl. Acad. Sci. USA, 60, 758–765. Grossberg, S. (1972). Neural expectation: Cerebellar and retinal analogs of cells red by learnable or unlearned pattern classes. Kybernetik, 10, 49–57. Markram, H., Pikus, D., Gupta, A., & Tsodyks, M. (1998). Potential for multiple mechanisms, phenomena and algorithms for synaptic plasticity at single synapses. Neuropharmacology, 37, 489–500. Markram, H., & Tsodyks, M. (1996). Redistribution of synaptic efcacy between neocortical pyramidal neurons. Nature, 382, 807–810. Markram, H., Wang, Y., & Tsodyks, M. (1998). Differential signaling via the same axon of neocortical pyramidal neurons. Proc. Natl. Acad. Sci. USA, 95, 5323–5328. McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 9, 127–147. Press, W. H., Teukolski, S. A., Vetterling, W. T., & Flannery, B. P. (1994). Numerical recipes in C: The art of scientic computing (2nd ed.). Cambridge: Cambridge University Press. Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386– 408. Rosenblatt, F. (1962). Principles of neurodynamics. Washington, DC: Spartan Books.
888
Gail A. Carpenter and Boriana L. Milenova
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructures of cognitions (Vol. 1, pp. 318–362). Cambridge, MA: MIT Press. Tsodyks, M., & Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Natl. Acad. Sci. USA, 94, 719–723. Tsodyks, M., Pawelzik, K., & Markram, H. (1998). Neural networks with dynamic synapses. Neural Computation, 10, 821–835. Werbos, P. J. (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences. Unpublished doctoral dissertation, Harvard University. Received July 28, 1999; accepted July 5, 2001.
LETTER
Communicated by Hagai Attias
Mean-Field Approaches to Independent Component Analysis Pedro A.d.F.R. Højen-Sørensen
[email protected] Department of Mathematical Modelling, Technical University of Denmark, DK-2800 Lyngby, Denmark Ole Winther
[email protected] Department of Mathematical Modelling and Center for Biological Sequence Analysis, Department of Biotechnology, Technical University of Denmark, DK-2800 Lyngby, Denmark Lars Kai Hansen
[email protected] Department of Mathematical Modelling, Technical University of Denmark, DK-2800 Lyngby, Denmark We develop mean-eld approaches for probabilistic independent component analysis (ICA). The sources are estimated from the mean of their posterior distribution and the mixing matrix (and noise level) is estimated by maximum a posteriori (MAP). The latter requires the computation of (a good approximation to) the correlations between sources. For this purpose, we investigate three increasingly advanced mean-eld methods: the variational (also known as naive mean eld) approach, linear response corrections, and an adaptive version of the Thouless, Anderson and Palmer (1977) (TAP) mean-eld approach, which is due to Opper and Winther (2001). The resulting algorithms are tested on a number of problems. On synthetic data, the advanced mean-eld approaches are able to recover the correct mixing matrix in cases where the variational meaneld theory fails. For handwritten digits, sparse encoding is achieved using nonnegative source and mixing priors. For speech, the mean-eld method is able to separate in the underdetermined (overcomplete) case of two sensors and three sources. One major advantage of the proposed method is its generality and algorithmic simplicity. Finally, we point out several possible extensions of the approaches developed here. 1 Introduction
Reconstruction of statistically independent source signals from linear mixtures is an active research eld with numerous important applications (for Neural Computation 14, 889–918 (2002)
° c 2002 Massachusetts Institute of Technology
890
P. Højen-Sørensen, O. Winther, and L. Hansen
background and references, see Lee, 1998; Girolami, 2000). Blind signal separation in the face of additive noise typically involves four estimation problems: estimation of source signals, source distribution, mixing coefcients, and noise distribution. A full Bayesian treatment of the combined estimation problem is possible but requires extensive Monte Carlo sampling (Belouchrani & Cardoso, 1995); therefore, several authors have proposed variational (also known as mean eld or ensemble) approaches in which the posterior distributions are either approximated by factorized gaussians or integrals over the posteriors are evaluated by saddle-point approximations (Attias, 1999; Belouchrani & Cardoso, 1995; Lewicki & Sejnowski, 2000; Lappalainen & Miskin, 2000; Hansen, 2000; Knuth, 1999). The resulting algorithm is an expectation-maximization (EM)–like procedure with the four estimations performed sequentially. One important problem with these approximations arises from the assumed posterior independence of sources. In particular, variational mean-eld theory using factorized trial distributions treats only “self-interactions” correctly, while producing trivial second moments, that 6 is, hSi Si i D hSi ihSj i for i D j. This is a poor approximation when estimating the mixing matrix and noise distribution since these estimates will typically depend on correlations. Recently, Kappen and RodrÂõ guez (1998) pointed out that for Boltzmann machines this naive mean-eld (NMF) approximation, introduced in this context by Peterson and Anderson (1987), may fail completely in some cases. They went on to propose an efcient learning algorithm based on linear response (LR) theory. LR theory gives a recipe for computing an improved approximation to the covariances directly from the solution to the NMF equations (Parisi, 1988). In this article, we give a general presentation of LR theory and apply it to the probabilistic ICA problem. We also briey outline the supposedly more accurate adaptive TAP mean-eld theory (Opper & Winther, 2001) and compare this method to the NMF and LR approaches. Whereas estimates of correlations obtained from variational mean-eld theory and its linear response correction in general differ, adaptive TAP is constructed such that it is consistent with linear response theory. We expect that advanced mean-eld methods such as LR and TAP can be useful in the many contexts within neural computation, where variational mean-eld theory already has proven to be useful, for instance, for sigmoid belief networks (Saul, Jaakkola, & Jordan, 1996). In our experience, the main difference between variational mean eld and the advanced methods lies in the estimates of correlations (often needed in algorithms of the EM type) and the calculation of the likelihood of the data. We will not discuss the latter here, however (see Opper & Winther, 2001, for a general method for computing the likelihood from the covariance matrix). In ICA simulations, we nd that the variational approach can fail typically by ignoring some of the sources and consequently overestimating the noise
Mean-Field Approaches to Independent Component Analysis
891
covariance. The LR and TAP approaches, on the other hand, succeed in all cases studied. However, we do not nd a signicant improvement using TAP (which is also somewhat more computationally intensive), suggesting that LR is close to being the optimal mean-eld approach for the probabilistic ICA model. The derivations of the mean-eld equations are valid for a general source prior (without temporal correlation) and tractable for priors that can be folded analytically with a gaussian distribution. This includes mixture of gaussians, Laplacian, and binary distributions. For other priors, one has to evaluate an extensive number of one-dimensional integrals numerically. Alternatively, one can construct computationally tractable ICA algorithms using priors that are dened only implicitly. To illustrate this point, we dene one such algorithm, which approximately corresponds to the prior having a power law tail. To underline the exibility and computational power of the probabilistic ICA framework and its mean-eld implementation, we give two quite different real-world examples of recent interest that straightforwardly can be solved in this framework. The rst example is that of separating speech in the overcomplete setting of two sensors and three sources (Lewicki & Sejnowski, 2000) using a heavy-tailed source prior such a Laplacian or the (approximative) power law prior described above. The second real-world problem considered is that of feature extraction in images. For images, it is natural to work with a nonnegativity constraint for the mixing matrix and sources, as in Lee and Seung (1999) and Miskin and MacKay (2000). In the probabilistic framework, this type of prior knowledge is readily built into the mixing matrix and source priors. Throughout this article, we conne ourselves to xed source priors.There are, however, no theoretical problems in extending the EM algorithm to estimating hyperparameters (see Attias, 1999, for an example of such source prior parameter estimation). In section 2 the basic probabilistic ICA model and the associated learning problem are stated. Section 3 concerns the inference part of the learning problem; we will see that variational mean-eld theory, linear response theory, and the adaptive TAP approach can be seen as stepwise more rened ways of estimating correlations. Applying the advanced mean-eld methods to independent component analysis is the main contribution of this article. Another contribution is the generality of the framework. In section 4, we examine various types of explicitly given source priors, which leads us to dene an implicitly given source prior. The impatient or applicationminded reader might consult section 4.1, which shows a table summarizing all priors considered. Section 5 shows some simulation results on both synthetic and real-world data. Finally, obvious ways to extend this work are outlined in the conclusion in section 6. The pseudocode for the algorithm is outlined in appendix A, and some additional priors not directly used are given in appendix B.
892
P. Højen-Sørensen, O. Winther, and L. Hansen
2 Probabilistic ICA
We formulate the ICA problem as follows (Hansen, 2000). The measurements are a collection of N temporal D-dimensional signals X D fXdtg, d D 1, . . . , D and t D 1, . . . , N, where Xdt denotes the measurement at the dth sensor at time t. Similarly, let S D fSmt g, m D 1, . . . , M, denote a collection of M mutually statistical independent sources, where Smt is the mth source at time t. The measured signal X is assumed to be an instantaneous linear mixing of the sources corrupted with additive white gaussian noise ¡ , that is, X D AS C ¡ ,
(2.1)
where A is a (time-independent) mixing matrix and the noise is assumed to be without temporal correlations and with time-independent covariance matrix § , that is, we have C dtC d0 t0 D dtt0 Sdd0 . We thus have the following likelihood for parameters and sources: N 1 T P (X | A , § , S ) D (det 2p § ) ¡ 2 e ¡ 2 Tr( X ¡AS ) §
¡1
(X ¡AS )
.
(2.2)
The aim of ICA is to recover the unknown quantities—the sources S , the mixing matrix A , and the noise covariance § —from the observed data. The main difculty is associated with estimating the source signals. The estimation problems for the mixing matrix and the noise covariance matrix are relatively simple, given the sufcient source statistics. Hence, our primary objective is to improve on the estimate of sufcient statistics from the posterior distribution of the sources. The mixing matrix A and the noise covariance § are then estimated by maximum a posteriori (MAP) (or maximum likelihood II, ML-II). This naturally leads to an EM-type algorithm where the expectation step amounts to nding the posterior mean and covariances of the sources and the maximization step is the MAP/ML-II estimation. Mean-eld methods, especially the advanced ones, are well suited for the nontrivial expectation step. Given the likelihood equation 2.2, the posterior distribution of the sources is readily given by P (S | X, A, § ) D
P (X | A , § , S ) P ( S ) , P ( X | A, § )
(2.3)
where P ( S ) is a prior on the sources, which might include temporal correlations (although we will postpone this problem to a future contribution). 2.1 Estimation of Mixing Matrix and Noise Covariance. The likelihood
of the parameters is given by Z P (X | A, § ) D dS P ( X | A , § , S ) P ( S ) .
(2.4)
Mean-Field Approaches to Independent Component Analysis
893
The problem of estimating the mixing matrix and noise covariance now amounts to nding the saddle points of the likelihood equation 2.4 with respect to the mixing matrix and noise covariance. We note that the saddle points will be given in terms of averages over the source posterior. These calculations of mean sufcient statistics with respect to the posterior are the main challenge for mean-eld approaches since the sources will be coupled through the observations. The mixing matrix A will be estimated by MAP and the noise by ML-II for convenience, A MAP D argmax P ( A | X , § )
(2.5)
A
§
MLII
D argmax P ( X | A , §
),
(2.6)
§
where the posterior of A is given by P (A | X , § ) / P ( X | A , § ) P ( A ), where P ( A ) is the prior on A . For the optimization in equations 2.5 and 2.6, we need the derivatives of the likelihood term, @ @A @ @§
¡1
log P (X | A , § ) D § log P (X | A , § ) D
1 § 2
(X hS iT ¡ A hSS T i)
¡1
h( X ¡ AS ) (X ¡ AS ) T i§
(2.7) ¡1
¡
N § 2
¡1
, (2.8)
where h¢i D h¢iS | A, , X denotes the posterior average with respect to the sources given the mixing matrix and noise covariance. Equating equation 2.8 to zero leads to the well-known result for § ,
§
MLII
D
1 h( X ¡ AS ) (X ¡ AS ) T i. N
(2.9)
In the particular case of measurements with independently and identically distributed noise, we can simplify the covariance § D s 2 I —hence, s 2 D Tr § MLII / D, where D is the number of sensors. Q For A , we consider two factorized priors, P ( A ) D dm P ( Adm ) , a zero mean gaussian P ( Adm ) / exp(¡adm A2dm / 2) , and the Laplace distribution P ( Adm ) / exp(¡bdm |Adm | ). Furthermore, we consider optimizing Adm both unconstrained and constrained to be nonnegative. Clearly, the MAP approach offers a exibility for encoding prior knowledge about A that is not available in the maximum likelihood II approach; one can encode sparseness (Hyv¨arinen & Karthikesh, 2000) and nonnegativeness (for images and text; see section 5 and Lee & Seung, 1999; Miskin & MacKay, 2000). 2.1.1 Unconstrained Mixing Matrices. A straightforward calculation gives the following iterative equation for the MAP estimate of A , ± ² A ( k C 1) D X hS iT ¡ § (aA ( k ) C bsign( A ( k) ) ) hSS T i¡1 , (2.10)
894
P. Højen-Sørensen, O. Winther, and L. Hansen
where we have included both priors and set adm D a and bdm D b. This equation can be solved explicitly for the gaussian prior with equal noise variance on all sensors—b D 0 and S D s 2 I, ± ²¡1 A D X hS iT hSS T i C as 2 I . (2.11) The ML-II estimate is the special case obtained by setting a D 0. 2.1.2 Nonnegative Mixing Matrices. To enforce nonnegative A , we introduce a set of nonnegative Lagrange multipliers Ldm ¸ 0 and maximize the modied cost: log P ( A | X , § ) C Tr L T A . Solving for the Lagrange multipliers, we get LD§
¡1
( A hSS T i ¡ X hS iT ) C aA C b.
(2.12)
We can write down an iterative update rule for Adm > 0 using the KuhnTucker condition Ldm Adm D 0 (Luenberger, 1984), together with the result for the Lagrange multipliers: ( k C 1)
Adm
D
[§ [§
¡1
¡1
X hS iT ]dm
( )
( )
k Cb A ( k) hSS i]dm C aAdm T
k Adm .
(2.13)
In the case of no prior knowledge—a D 0 and b D 0—we get an update rule similar to the image space reconstruction algorithm used in positron emission tomography (see, e.g., Pierro, 1993, for references) or the more recently proposed nonnegative matrix factorization procedure of Lee and Seung (1999). 3 Mean-Field Theory
We present three different mean-eld approaches that give us estimates of the source second moment matrix of increasing quality. First, we derive mean-eld equations using the standard variational mean-eld theory. Next, using linear response theory, we obtain directly from the variational solution improved estimates of hSS T i needed for estimating A and § . Finally, we present the adaptive TAP approach of Opper and Winther (2001), which goes beyond the simple factorized trial distribution of variational mean-eld theory to give a theory that is self-consistent to within linear response corrections. From mean-eld theory, we also get an approximation to the likelihood P (X | A , § ) which can be used for model selection (Hansen, 2000). 1 In appendix A, we summarize all mean-eld equations and give an EM-type recipe for solving them. 1 The variational approximation is a lower bound to the exact likelihood, whereas the TAP and LR approximations (not given here) are not bounds, but hopefully are more accurate.
Mean-Field Approaches to Independent Component Analysis
895
The following derivation is valid for any source prior without temporal correlations. Specic source priors are discussed in section 4. Although equations for the mean-eld estimates of the mean and covariance of the sources are written with equality in this section, it is to be understood that they are only approximations. 3.1 Variational Approach. We adopt a standard variational mean-eld ( ) theoretic approach and approximate the posterior Q distribution, P S | X , A , § , in a family of product distributions Q ( S ) D mt Q (Smt ).2 For a gaussian likelihood P (X | A , § , S ) , the optimal choice of Q ( Smt ) is given by a gaussian times the prior (Csat Âo, FokouÂe, Opper, Schottky, & Winther, 2000):
Q ( Smt ) / P ( Smt ) e
¡ 12 lmt S 2 Cc mt Smt mt .
(3.1)
To simplify the notation in the following, we will parameterize the likelihood as P (X | A , § , S ) D P ( X | J , h, S ) D
1 ¡ 1 Tr(S T JS ) C Tr( hT S ) e 2 , C
(3.2)
where log C D N2 log det 2p § C 12 Tr X T § ¡1 X , the M £ M interaction matrix J, and the eld h (having same dimension as S ) are given by J D AT § T
hDA §
¡1
A
(3.3)
¡1
X.
(3.4)
Note that h acts as an external eld from which all moments of the sources can be obtained. This is the key property that we will make use of in the next section when we derive the linear response corrections. The starting point of the variational derivation of mean-eld equations is the KullbackLeibler divergence between the product distribution Q ( S ) and the true source posterior— Z dS Q ( S ) log
KL D
D log P ( X | A , §
log P ( X | A , § , NMF) D
X
Z
log
Q(S) ( | P S X, A , § )
) ¡ log P (X | A , § , NMF)
(3.5)
¡ 12 lmt S2 Cc mt Smt mt
(3.6)
dSmt P ( Smt ) e
mt
2 Note that Q(Smt ) is a also the variational mean-eld approximation to the marginal RQ distribution dSm 0 t0 P (S | X , A , § ). 6 m,t0 D 6 t m0 D
896
P. Højen-Sørensen, O. Winther, and L. Hansen C
1X (lmt ¡ Jmm ) hS2mt i C Tr(h ¡ ° ) T hS i 2 mt
C
1 TrhS T i(diag( J ) ¡ J )hS i ¡ ln C, 2
where P ( X | A , § , NMF) is the naive mean-eld approximation to the likelihood and diag( J ) is the diagonal matrix of J. The Kullback-Leibler is zero when P D Q and positive otherwise. The parameters of Q should consequently be chosen as to minimize KL. The saddle points dene the meaneld equations:3 @KL @hS i
@KL
@hS2mt i
D 0: D 0:
° D h ¡ (J ¡ diag( J ) ) hS i
(3.7)
lmt D Jmm .
(3.8)
The remaining two equations depend explicitly on the source prior, P ( S ) : @KL @c mt
D 0:
hSmt i D
Z
@ @c mt
log
dSmt P ( Smt ) e
¡ 12 lmt S2 Cc mt Smt mt
´ f (c mt , lmt ) @KL @lmt
D 0:
hS2mt i
D ¡2
£e
@ @lmt
(3.9) Z
log
dSmt P ( Smt )
¡ 12 lmt S 2 Cc mt Smt mt .
(3.10)
The variational mean f (c mt , lmt ) plays a crucial role in dening the mean@f eld algorithm since all dependence on the prior is implicit in f (and in @c as well for the advanced methods). Combining equations 3.8, 3.7, and 3.9, we see that the variational mean is given in terms of a set of coupled xedpoint equations, which depends on the interaction matrix J and external eld h . (For details on how to solve this set of xed-point equations, see appendix A.) An analysis of different strategies for iterating the mean-eld equations in the context of Potts spin glasses can be found in Peterson and Soderberg ¨ (1989). In section 4, we calculate f (c mt , lmt ) for some of the prior distributions found in the ICA literature. We will primarily consider source priors, which can be integrated analytically against the gaussian kernel and hence avoid numerical integration. 3 The requirement that we should be at a local minima of log P(X | A , § , NMF) is fullled when the covariance matrix equation 3.12 is positive denite. To test whether we are at the global minima is harder. However, when the model is well matched to the data, we expect the problem to be convex.
Mean-Field Approaches to Independent Component Analysis
897
3.2 Linear Response Theory. So far we have not discussed how to obtain mean-eld approximations to the covariances 0
tt Âmm 0 ´ hSmt Sm0 t0 i ¡ hSmt ihSm0 t0 i.
Since variational mean-eld theory uses a factorized trial distribution, the covariances among variables are trivially predicted to be zero. However, using linear response theory, we can improve the variational mean-eld solution. As mentioned earlier, h acts as an external eld. This makes it possible to calculate the means and covariances as derivatives of log P (X | J, h ) — hSmt i D 0
tt Âmm 0 D
@ log P ( X | J , h )
(3.11)
@hmt @2 log P ( X | J, h )
D
@hm0 t0 @hmt
@hSmt i @hm0 t0
.
(3.12)
These relations are exact when using the exact likelihood. However, we can also use the NMF likelihood through the mean-eld equations 3.7 through tt0 3.9 to derive an approximate equation for Âmm 0: 0
tt Âmm 0 D
D
@f (c mt , lmt ) @c mt @c mt
@hm0 t0
³
@f (c mt , lmt )
¡
@c mt
´
X 6 m m00 ,m00 D
Jmm00 Âmtt00 m0 C dmm0
dtt0 .
(3.13)
As a direct consequence of the lack of temporal correlations in the present tt0 t setting, the Â-matrix factorizes in time: Âmm 0 D d tt 0 Âmm0 . We can straightfort wardly solve for Âmm0 , h t (¤ Âmm 0 D
t C
J ) ¡1
i mm0
,
(3.14)
where we have dened the diagonal matrix ³
¤
t
D diag (L 1t , . . . , L Mt ) ,
L mt ´
@f (c mt , lmt ) @c mt
´ ¡1
¡ Jmm . (3.15) @hS
i
t,NMF For comparison, the naive mean-eld result is Âmm D dmm0 @h mt , which 0 mt follows directly from equation 3.10. Why is the covariance matrix obtained by linear response more accurate? Here, we give an argument that can be found in Parisi’s book on statistical eld theory (Parisi, 1988). Let us assume (as always implicit in any meaneld theory) that the approximate and exact distribution is close in some sense, that is, P (S | X , A , § ) D Q ( S ) C e where e is small. Then by direct
898
P. Højen-Sørensen, O. Winther, and L. Hansen
t,exact t,NMF application of the factorized distribution, we have Âmm D Âmm0 C O ( e ) . 0 By exploiting the nonnegativity of KL, equation 3.5, we can also prove that 2
( |
@ log P X J, h,NMF t,LR the linear response estimate Âmm has an error of O ( e 2 ) . 0 D @hm0 t @hmt Since KL ¸ 0, the NMF theory log-likelihood gives a lower bound on the log-likelihood, and consequently, the linear term vanishes in the expansion of the log-likelihood: log P ( X | J, h ) D log P (X | J, h , NMF) C O (e 2 ) . This shows that as long as e is small, estimates of moments obtained by linear response are more accurate than using the trial distribution directly. For a specic case, it is possible to demonstrate the improvement directly. Consider the gaussian prior4 P ( Smt ) / exp(¡S2mt / 2). In this case, the variational mean eld, equation 3.9, is given by f (c , l) D c / (1 C l) . Thus, the @hS i t,NMF variational mean-eld theory predicts Âmm D dmm0 @hmt D 1 / (1 C lmt ) D 0 mt 1 / (1 C Jmm ). However, the linear response estimate, equation 3.14, gives £ ¤ t,LR Âmm D ( I C J ) ¡1 mm0 and hence reconstructs the full covariance matrix 0 identical with the exact result obtained by direct integration.
)
3.3 Adaptive TAP Approach. So far we have derived two different estit,NMF mates of the covariance matrix from variational mean-eld theory: Âmm D 0 £ ¤ @hSmt i t,LR ¡1 dmm0 @h and Âmm0 D ( ¤ t C J ) mm0 . Obviously there is no guarentee that mt the two estimates are identical. Variational mean-eld theory is thus not selfconsistent to within linear response corrections. The adaptive TAP approach (Opper & Winther, 2001), on the other hand, goes beyond the factorized trial distribution and requires self-consistency for the covariances estimated by linear response. It is beyond the scope of this article to rederive the adaptive TAP mean-eld theory. Consult Opper and Winther (2001) for a derivation valid for a model with quadratic interactions and general variable prior, the model considered in this article. We have chosen to present and test adaptive TAP because it offers the most advanced (and, we hope, the most precise) mean-eld approximation for this type of model. The self-consistency is achieved by introducing a set of MT additional mean-eld (or variational) parameters, the variances lmt in the marginal t,TAP distribution equation 3.1, such that the diagonal term Âmm obeys @hSmt i @hmt
h
D ( ¤ t C J ) ¡1
i mm
,
(3.16)
where L mt and c mt now depend on lmt : ¡ t ¢ ¡1 L mt D Âmm ¡ lmt
(3.17)
4 A gaussian source prior is not suitable for doing source separation. We merely use it here to show that the linear response correction in this case recovers the exact result.
Mean-Field Approaches to Independent Component Analysis
c mt D hmt ¡
X m0
( Jmm0 ¡ lm0 tdmm0 ) hSm0 t i.
899
(3.18)
To recover the variational mean-eld equations, 3.15 and 3.7, we let lmt D Jmm . 4 Source Models
In this section we calculate for various source priors the variational mean f , equation 3.9, and the derivative @f / @c needed for the linear response correction and adaptive TAP. The priors that we are considering are all chosen such that the variational mean can be calculated using tables of standard integrals (Gradshteyn & Ryzhik, 1980). It turns out to be convenient to introduce the gaussian kernel D with unit variance and its associated cumulative distribution function (cdf) W in order to keep the following expressions of a manageable size: ³ ´ 1 1 exp ¡ x2 , D ( x) D p 2 2p Z x W ( x) D D ( t) dt, ¡1
D0 ( x ) D ¡xD(x)
(4.1)
W 0 ( x) D D ( x ).
(4.2)
4.1 Summary of Source Priors. Table 1 summarizes the variational means and response functions corresponding to the priors described in this article. This is by no means a complete list of all priors for which it is possible to calculate these quantities (e.g., the Rayleigh distribution is one such prior). 4.2 Mixture of Gaussians Source Prior. In this section, we consider a
general mixture of gaussians, p (S|m , s) D
Nm X
p i p (S|m i , si ) ,
i D1
S2R
(4.3)
where each of the Nm individual mixture components is parameterized by p (S|m i , si ) D p
1 2p si2
1 2 2 e¡ 2 ( S¡m i ) / si .
(4.4)
Using this source prior, the generative ICA model becomes the independent factor analysis model proposed in Attias (1999). Since the main scope of this article is concerned with reliably inferring mean sufcient statistics with respect to the sources, we will, contrary to Attias (1999), always regard the source parameters as xed; we are at no time adapting the source priors to
900
P. Højen-Sørensen, O. Winther, and L. Hansen
Table 1: Variational Mean and Response Function Corresponding to Various Source Priors. P (S)
Source Prior
1 (S 2d
Binary Gaussian mixture
¡ 1) C 12 d (S C 1)
Equation 4.3 p1 2p
Gaussian Heavy tail
exp (¡S2 / 2)
Not analytic 1 b¡a
uniform Laplace Positive gaussian
p
exponential
H (S ¡ a)H (b ¡ S)
Response
f (c , l) D hSi
Function
tanh(c )
1 ¡ hSi2
Equations 4.5 and 4.7 c / (1 C l) c l
c
¡ a la C c 2
@hSi @c
1 / (1 C l) 1 C l
2
c ¡la a (la C c 2 )2
Equation B.14
EquationB.15
Equation 4.9
Equation 4.11
exp (¡S2 / 2)H (S)
Equation B.9
Equation B.11
exp (¡S)H (S)
Equation 4.13
Equation 4.14
1 2
2 p
Mean Function
exp(¡|S|)
Notes: The three rst rows describe source priors having negative, zero, and positive kurtosis, respectively. The fourth row expresses nonnegative priors. The step function is dened as H (S) D 1 for S > 0 and zero otherwise.
data. However, it is straightforward to extend the proposed methodology to allow for this possibility—for example, in an EM setting where the improved mean-eld solutions are being used in the posterior expectation of the complete log-likelihood. Trivial but tedious calculations show that the variational mean f of a mixture of gaussians is given by PN m f D
c si2 C m i ji i D 1 k i lsi2 C 1 e , P Nm j i i D1 k i e
(4.5)
where we have introduced pi and , lsi2 C 1 ³ ´ (c si C m i / si ) 2 1 (m i / si ) 2 ¡ ji D ¡ . 2 lsi2 C 1
ki D p
(4.6)
The derivatives with respect to c are easy to obtain but are omitted in the interest of space. For the special case of a mixture of two gaussians (Nm D 2) with common variance s 2 and means m i D §m , we get f D
1 ls 2 C 1
³ ³ 2 c s C m tanh
cm ls 2 C 1
´´ .
(4.7)
Mean-Field Approaches to Independent Component Analysis
901
For s 2 D 0 and m D 1, we recover the variational mean for the binary source P ( S ) D 12 d ( S ¡ 1) C 12 d ( S C 1) : f D tanh (c ). This particular choice of the bigaussian source distribution, equation 4.7, which is also known as the symmetric Pearson mixture density, was proposed in Girolami (1998) as a simple way of archiving a negative kurtosis (subgaussian) density function. To become familiar with the f -function and its derivative, consider the variational mean of the bigaussian with s 2 D 1 shown in Figures 1a and 1b for two values of m : m D 1, for which the density function is unimodal, and m D 4, for which the density function is signicantly bimodal. We seen that the more bimodal the source distribution is, the more compact becomes the region of high curvature. By introducing additional mixture components, it is possible to form the region of high curvature, which is illustrated in Figure 1g in the case of a mixture of ve gaussians. 4.3 Laplace Source Prior. Although a subgaussian distribution may be a reasonable source prior for some applications, such as telecommunications (discrete priors; see van der Veen, 1997) or processing of functional magnetic resonance images (Petersen, Hansen, Kolenda, Rostrup, & Strother, 2000), there is a large class of interesting real-world signals, such as speech, with heavier tails than the gaussian distribution. We therefore need to consider source priors that have positive kurtosis (supergaussian). One such choice, which has been widely used in the ICA community, is P ( S ) D 1 / (p cosh S ) (Bell & Sejnowski, 1995; MacKay, 1996). Using this prior, however, it is not possible to calculate the variational mean analytically. Instead, we consider the Laplace or double exponential distribution, which is very similar. The Laplace density is given by
p (S) D
g ¡g|S| e , 2
S 2 R, g > 0.
(4.8)
The variational mean can be calculated as 1 j Ck C C j ¡k ¡ f D p , l k C C k¡
(4.9)
where we have introduced c ¨g j§ D p , l
and
k § D W ( §j § ) D (j¨ ) .
(4.10)
Using equations 4.1 and 4.2, the derivative is found to be @f @c
D
³ 1 jC ¡ j¡ 1 ¡ j ¡j C C D (j C ) D (j ¡ ) l kC C k¡ ´ p (j Ck ¡ C j ¡k C ) C l f . (k C C k ¡ )
(4.11)
902
P. Højen-Sørensen, O. Winther, and L. Hansen
Figure 1: The variational mean f (left columns) and its derivative f 0 (right columns) as a function of c and l. (a, b) The bigaussian case with s 2 D 1 for m i D § 1 and m i D § 4, respectively. (c, d) The Laplacian prior for decay rates g D 1 / 2 and g D 2, respectively. (e, f) The exponential prior for decay rates g D 1 / 2 and g D 2, respectively. (g) The variational mean f and the derivative f 0 of a mixture of ve gaussians with mixing proportions p i D 1 / 5, means m i D f¡4, ¡1, 0, 1, 4g, and standard deviations si D f1, 2, 4, 2, 1g. (h) The heavytailed prior, equation 4.17, with a D 1.
Figures 1c and 1d show the variational mean and its derivative for a slowly decaying (g D 0.5) and a quickly decaying (g D 2) Laplacian prior. The Laplacian prior has, contrary to the bigaussian source, its region of high curvature for numerical large values of c . 4.4 Exponential Source Prior. Some application domains naturally restrict the possible range of the hidden sources and the mixing matrix due
Mean-Field Approaches to Independent Component Analysis
903
to the physical interpretation of these quantities in the generative model. This is, for instance, the case when the measured signal is known to be a positive superposition of latent counting numbers or intensities. Positivity constrains are relevant in parts-based representations of natural images, deconvolution of the power spectrum of nuclear magnetic resonance resonance (NMR) spectrometers, and latent semantic analysis in text mining (Lee & Seung, 1999). In this section, we consider the exponential source prior parameterized by p (S ) D ge¡gS ,
S 2 R C , g > 0,
(4.12)
which gives 1 j W (j ) C D (j ) f D p W (j ) l @f @c
D
1 D (j ) C p f, l lW (j )
(4.13) (4.14)
with c ¡g j D p . l
(4.15)
Figures 1e and 1f show the variational mean and its derivative for the exponential source prior. It is veried that the exponential variational mean is nonnegative. At this point, we will make some brief comments on some algorithmic issues when the normal cdf W appears in the denominator of the variational mean. Special care has to be taken when j ! ¡1, for example, when c ¡ g < 0 and l is small, that is, for small self-interactions. Using l’Hospital’s rule together with equations 4.1 and 4.2, it is seen that D (j ) ! ¡j for j ! ¡1, W (j )
(4.16)
which implies that the variational mean f ! 0 and its derivative ( @f / @c ) ! 1 / l for j ! ¡1. In section 5.4, we will use this prior to learn a set of sparse localized basis functions in images. The source priors considered thus far are just some examples of priors where the variational mean can be computed analytically. In appendix B, we state some additional examples of priors for which this calculation can be carried out analytically. 4.5 Power Law Tail Prior. In the previous sections, we have considered only source priors for which it was possible to carry out the integration equation 3.9 analytically. For arbitrary source priors, however, the onedimensional integral may be be solved using standard approaches for numerical integration. Alternatively, we could simply use the insight gained in
904
P. Højen-Sørensen, O. Winther, and L. Hansen
the previous sections, where we considered the functional form of the variational mean of various source priors, to come up with computationally tractable f functions directly. To give an example of this, we will construct p an f that for large |c | / l corresponds to a distribution with a power law tail P ( S ) / |S| ¡a . In this limit, the integral in equation 3.9qis dominated by its sadc (1 C 1 ¡ 4al ) ¼ cl ¡ ca . This dle point. The saddle-point value of S is S 0 D 2l c2 gives the behavior of the mean function for large c . We can now straightforwardly construct a mean function that has this asymptotic behavior and is well dened for small values of c : f D
c ac ¡ . l al C c 2
(4.17)
Figure 1h shows the heavy-tail f -function as a function of c and l. Figure 2 shows for a xed l D 1 the variational mean and derivative for some of the unconstrained source priors considered so far. For c ! 1, the gaussian and the uniform (improper) prior give, respectively, the the lower and upper value for f for the priors considered. The variational means and derivatives for the priors considered in this paper are summarized in Table 1. 5 Simulations
In this section, we compare the performance of the different mean-eld approaches described in the previous sections: NMF, LR correction, and TAP. To begin, we conduct two experiments with articial generated data. The source priors used in these experiments are equal to the source prior that generated the data set. We consider both the complete case in which two binary sources are mixed into two sensors and the overcomplete case of three continuous sources mixed into two sensors. Finally, we apply the linear response corrected mean-eld approach to perform ICA on two realworld data sets: speech signals and parts of the MNIST handwritten digit database. 5.1 Synthetic Binary Sources in a Complete Setting. Independent component analysis of binary sources has been considered in data transmission using binary modulation schemes such as MSK or biphase codes (van der Veen, 1997). Here, we consider a binary source S D f § 1g with prior distribution 12 [d (S ¡1) C d ( S C 1) ]. In this case we recover the well-known mean-eld equations hSi D tanh (c ). Figure 3a shows the column vectors of the mixing matrix and 1000 samples generated from the ICA generative model using a fairly low-noise variance, s 2 D 0.3. Ideally, the noiseless measurements would consist of the four combinations (with sign) of the columns in the mixing matrix. However, due to the noise, the measurements will be scattered around these prototype observations (shown as + in Figure 3a). Figure 3b
Mean-Field Approaches to Independent Component Analysis
G auss, ( =1, =1) G auss, ( =0, =1, 1
905
=2)
2
G auss, ( = 3, =1) Laplace, ( =1) Laplace, ( =1/4) =1/2 U niform (im proper)
G a us s, ( = 1, = 1) G a us s, ( = 0, =1 , 1
=2 )
2
G a us s, ( = 3, =1 ) L a pla ce , ( = 1) L a pla ce , ( = 1/4) =1 /2 U n ifo rm (im p rop e r)
Figure 2: (a) Variational mean and (b) derivative as a function of c for various source priors and xed l D 1. Legends: [- - ] gaussian with unit mean and variance; [¢ ¢ ¢] mixture of two gaussians with 0 mean and standard deviation 1 and 2; [- -] mixture of two gaussians with unit variance and mean at § 3; [—] and [--] Laplacian with g D 1 and g D 1 / 4, respectively; [¢-¢-] heavy tail with a D 1 / 2; [- -] uniform (improper) distribution.
shows, for each of the mean-eld approaches, the variance as a function of iteration number. At these moderate noise variances, an improvement in the convergence rate is obtained by using the linear response corrected mean-eld solution. The adaptive TAP approach, on the other hand, is seen to have a slower convergence rate, and only a marginal improvement in the estimated noise variance and mixing matrix is obtained. This is due to the fact that this approach is critically sensitive to how well the variational parameters have been determined.
906
P. Højen-Sørensen, O. Winther, and L. Hansen
NMF LR TAP Empirical
2
Figure 3: Binary source recovery for a low-noise variance, s 2 D 0.3. (a) 1000 measurements (scatter plot), +/¡ the column vectors of the true mixing matrix (the solid axis), and the measurement prototypes (+ ) for the noise-less case. (b) Estimated variance for NMF, LR and TAP as a function of iteration. The thick solid line is the true empirical noise variance. The empirical variance is the variance of the 1000 random noise contributions. The trajectories of the xedpoint iteration using (c) NMF, (d) LR, and (e) adaptive TAP. The initial condition is marked £ and nal point ±. The dashed lines are the true mixing matrix.
Figures 3c–e show, for the different mean-eld approaches, the trajectories of the xed-point iterations. All the methods use the same initial conditions (£), and the nal point in the trajectory is marked ±. The dashed lines are +/¡ the column vectors of the true mixing matrix. In this case, there is no signicant difference in the mixing matrix estimated using the different mean-eld approaches. We now increase the noise variance to s 2 D 1. In this case, it is hard to identify the prototype signals from the measured data (see Figure 4a). The naive mean-eld approach fails in recovering the mixing matrix. Figure 4c
Mean-Field Approaches to Independent Component Analysis
NMF LR TAP Empirical
907
2
Figure 4: Binary source recovery for a high noise variance, s 2 D 1. (a) 1000 measurements (scatter plot), +/¡ the column vectors of the true mixing matrix (the solid axis), and the measurement prototypes (+) for the noiseless case. (b) The estimated variance for NMF, LR, and TAP as a function of iteration. The thick solid line is the true empirical noise variance. The trajectories of the xedpoint iteration using (c) NMF, (d) LR, and (e) adaptive TAP. The initial condition is marked £ and nal point ±. The dashed lines are the true mixing matrix.
shows that one of the directions in the mixing matrix vanishes during the xed-point iterations, which results in the noise variance’s being overestimated (see Figure 4b). However, the linear response corrected mean-eld approach and adaptive TAP recovers the true mixing matrix. 5.2 Continuous Sources in an Overcomplete Setting. In this section, the problem is to recover more sources than sensors; in particular, we consider mixing three sources into two sensors. The source used in this experiment is the symmetric Pearson mixture, equation 4.7, with m D 1. A total of
908
P. Højen-Sørensen, O. Winther, and L. Hansen
2000 samples was generated from the generative model (see Figure 5a), and the three mean-eld approaches were used to learn the mixing matrix. The trajectories plotted in Figure 5c show that the naive mean-eld approach fails in recovering the mixing matrix. Similar to the binary case with high variance, one of the directions in the mixing matrix vanishes (see Figure 5). Only the dominant direction in the data space is captured, whereas the two remaining directions collapse into one “mean” direction. However, both the linear response corrected and the adaptive TAP mean-eld approaches succeed in estimating the mixing matrix. It appears likely that the adaptive TAP result is a bit worse than the LR solution. The reason is that adaptive TAP stops after approximate 2200 iterations (for the particular value of ftol used in this experiment, see appendix A), whereas LR continues for a little more than 4000 iterations. The slow convergence of the adaptive TAP approach is essentially due to the estimation of the additional variational parameters. We will restrict ourselves to the LR approach in the next realworld examples since NMF has turned out to fail in some cases and TAP is considerably more computationally expensive while giving comparable performance. 5.3 Separating Three Speakers from Two Microphones. In this section we consider the problem of separating three speakers from two microphones. This experimental example was originally reported in Lee, Lewicki, Girolami, and Sejnowski (1999). At hand, we have the three original speech signals, each having a duration of 1 second and sampled at 8 kHz. The speech signal is then instantaneously linearly mixed into two microphones. Figure 6a shows a scatter plot of the 8000 samples in the measurement (microphone) space. The fact that natural speech has a heavy-tailed distribution makes this overcomplete problem somewhat easier in the sense that the hidden directions of the mixing matrix reveal themselves clearly in the scatter plot. The linear response corrected mean-eld approach was used in performing ICA with the computationally tractable variational mean, equation 4.17, with a D 1. The initial mixing matrix was randomly picked (shown as the dotted axis in Figure 6a). Figure 6b shows the convergence of the algorithm in terms of the angle between the estimated directions and the true directions (the dashed lines in Figure 6a). Figure 6a shows that the algorithm converges rapidly to a mixing matrix that is very close to the one that actually mixed the speech signals. Figure 7 shows each of the inferred sources plotted against each of the true sources. We see that the three recovered sources are nicely correlated with exactly one of the true sources and (more or less) uncorrelated with the remaining sources. Notice that any relabeling of the sources and corresponding perturbation of the columns of the mixing matrix leaves the solution of the ICA problem invariant. 5.4 Local Feature Extraction with Sparse Positive Encoding. In this section, we apply the linear response corrected ICA algorithm to the problem of
Mean-Field Approaches to Independent Component Analysis NMF LR TAP Empirical
909
2
Figure 5: Overcomplete continuous source recovery with s 2 D 1. (a) 2000 measurements (scatter plot), +/¡ the column vectors (four times axis) of the true mixing matrix (the solid axis). (b) The estimated variance for NMF, LR, and TAP as a function of iteration. The thick solid line is the true empirical noise variance. The trajectories of the xed-point iteration using (c) NMF, (d) LR, and (e) adaptive TAP. The initial condition is marked £ and nal point ±. The dashed lines are the true mixing matrix.
nding a small set of localized images representing parts of the digit images in the MNIST handwritten digit database. For illustration purposes, we consider only a small subset of the database: the rst 500 cases of the handwritten digit 3. This experiment is identical to that reported in (Miskin & MacKay, 2000). It is natural to consider positive constraints on latent variables (say pixels) when dealing with images. However, such constraints are usually ignored by most of the commonly used preprocessing models; for example, the principal component analysis (PCA) generative model amounts simply to sequentially nding orthogonal directions (components) with maximum variance in the data space. Ignoring such constraints is problematic since
910
P. Højen-Sørensen, O. Winther, and L. Hansen
Figure 6: Overcomplete speech separation (3-in-2) using the heavy-tailed f function, equation 4.17, with a D 1; see Figure 1h. (a) Scatter plot of 1 sec of the mixed speech (8 kHz), the true A (dashed lines), the initial A (black dotted), and the estimated A . (b) Estimated angle as a function iteration. The horizontal lines illustrate true angles at 0 and § 45 degrees.
Figure 7: Overcomplete speech separation (3-in-2) using the heavy-tailed f function, equation 4.17, with a D 1. Scatter plots of the ICA estimated sources (i) (i) Sesti versus the true sources Strue , i D 1, 2, 3.
Mean-Field Approaches to Independent Component Analysis
911
for an unconstrained model to yield positive digit images, there has to be an interaction between positive and negative regions in different components, and it is therefore not obvious what the set of components represents visually. To illustrate these points, we conduct two ICA experiments using the exponential prior P ( S ) D e¡S , S 2 R C . In the rst experiment, we do not constrain the mixing matrix, whereas in the second experiment, the mixing matrix is constrained to be positive. For both experiments, we assume that there are 25 hidden images. Figure 8a shows the 25 hidden images obtained using ICA with positively constrained sources but unconstrained mixing matrix. Although the sources in this case are positively constrained, the fact that hidden images are allowed to be subtracted in order to obtain a positive image leads to nonlocal hidden images, which are hard to interpret visually. Figure 8b shows the 25 hidden images obtained by performing ICA, which enforces the positive constraint on the mixing matrix. In this case, the hidden images clearly represent local features, in particular, the different handwriting styles and strokes in the various parts of the written digit.
6 Conclusion
In this article, we have presented a probabilistic (Bayesian) approach to ICA. Sources are estimated by their posterior mean, and maximum a posteriori estimates are used for the mixing matrix and the noise covariance. By this procedure, we derived an EM-type algorithm. The expectation step is carried out using different mean eld (MF) approaches: variational (also known as ensemble learning or naive MF), linear response, and adaptive TAP. The MF theories produce estimates of posterior source correlations of increasing quality. These are needed for the maximization step in the estimate for the mixing matrix and the noise. The importance of a good estimate of correlations is seen for specic examples where the simplest variational approach fails. The general applicability of the formalism and its MF implementation is demonstrated on local feature extraction in images (using nonnegative mixing matrix and source priors) and in overcomplete separation of speech (using heavy-tailed source priors). The good performance of the mean-eld approach supports the belief that we get fair estimates of the posterior means and covariances. However, a rigorous test requires either explicit numerical integration, which is possible only for low-dimensional problems, or Monte Carlo sampling, which may also be inaccurate in complex cases. In the following, we will discuss a number of possible extensions of this work. One obvious extension is the modeling of temporal correlations. The most general formulation of the model with temporal correlation leads to the consideration of the junction tree algorithm.
912
P. Højen-Sørensen, O. Winther, and L. Hansen
Figure 8: Feature extraction of the handwritten digit 3 using an exponential prior with g D 1 and (a) unconstrained mixing matrix and (b) positive constrained mixing matrix.
Optimization of the hyperparameters of the prior can be performed by extending the current EM algorithm. The mean-eld approach can also be used to derive leave-one-out estimators (Opper & Winther, 2000, 2001), which can be used for both optimization of hyperparameters and model selection. Model selection can also be performed using the (approximate mean-eld) likelihood of either an independent test set or the training set, together with an asymptotic model selection criterion such as the Bayesian information criterion (BIC) (Schwarz, 1978). Finally, it could be interesting to relax some of the basic requirements of the model. First is that of statistical independence of the sources. Our formalism can be extended to treat a priori gaussian correlations between (the nongaussian) sources. We should be able estimate these correlations effectively by, for example, the linear response technique. Second, the model can be extended to nonlinear mixing by, for example, introducing a sigmoidal squashing of the mixed signal. With some increase in the computational complexity, this situation can also be included in the mean-eld framework (Opper & Winther, 2001).
Mean-Field Approaches to Independent Component Analysis
913
Appendix A: Algorithmic recipe
Table 2 gives an EM recipe for solving the mean-eld equations and the equations for the mixing matrix and the noise covariance. The table species which equations have been used. Here, we give the equations for adapTable 2: Pseudocode for the Mean-Field ICA Algorithms. Initialization: Equations 3.3, 3.4, and 3.8 ¡1
J :D A T § A ¡1 h :D A T § X hS i :D 0 (or small random values if 0 is a xed point) for m :D 1, . . . , M and t :D 1, . . . , N:
lmt :D Jmm
endfor
NhSi :D 20,Nl :D 10,NA :D 10,N§ :D 1,ftol :D 10¡5 do: Expectation-step: for NhSi iterations, Equations 3.18 and 3.9 for m :D 1, . . . , M P and t :D 1, . . . , N:
c mt D hmt ¡ m0 (Jmm 0 ¡ lm 0 t dmm 0 )hSm 0 t i dhSmt i :D f (c mt , lmt ) ¡ hSmt i
endfor hS i :D hS i C gdhS i endfor for Nl iterations, equations 3.17 and 3.16 for m :D 1, . . . , M and t :D 1, . . . , N:
L mt :D lmt C
1 f 0 (c mt , lmt )
dlmt :D £
1
endfor for m :D 1, . . . , M and t :D 1, . . . , N:
¤
(
¡1 t CJ)
lmt :D lmt C dlmt
¤
mm
¡
1 f 0 (c mt , lmt )
endfor endfor for t :D 1, . . . , N, equation 3.14 Â t :D (¤ t C J )¡1 endfor Maximization-step for NA iterations, equation 2.13 or 2.10 for d :D 1, . . . , D and m :D 1, . . . , M: ¡1 [§ X hS iT ]dm
dAdm :D
[
§
¡1
A hSS T i] dm C aAdm C b
Adm :D Adm C dAdm
Adm ¡ Adm
endfor endfor for N iterations, equation 2.9 § d § :D N1 h(X ¡ AS )(X ¡ AS )T i ¡ §
§
endfor J :D A T § h :D
AT
:D
§
§
¡1
C d§
A
¡1
X while max(|dhSmt i| 2 , |dlmt | 2 , |dAdm | 2 , |d Sdd 0 | 2 ) > ftol
914
P. Højen-Sørensen, O. Winther, and L. Hansen
tive TAP. Linear response theory is obtained by omitting the updating step 0 t for lmt —by setting Nl :D 0. Furthermore, setting Âmm 0 :D d mm0 f (c mt , l mt ) ¡1 t instead of  :D ( ¤ t C J ) leads to the naive mean-eld algorithm. In the table, we have given the update rule for the nonnegative mixing matrix, equation 2.13. To get to the unconstrained mixing matrix, the unconstrained update rule, equation 2.10, should be used. Note that we use a greedy update step for all variables but the means hS i. Adaptive TAP is especially sensitive to the choice of the learning rate g. It is therefore made adaptive such P that it is increased by a factor of 1.1 if the sum of the squared deviations mt |dhSmt i| 2 decreases compared to the previous update. Otherwise, it is decreased with a factor 2. Our experience with the TAP equations also indicates that running with a variable number of updates of hS i could be helpful. However, in the simulations described here, we kept the number of iterations xed. Appendix B: Some Additional Analytical Source Priors
In this appendix, we derive the variational mean and response function for some additional analytical source priors that have not been directly used in this article. We show these calculations in some detail since they are of the same type as the one we carried out in deriving the variational mean of the sources in section 4. B.1 Positively Constrained Gaussian Source Prior. Calculating the variational mean equation 3.9 in general involves calculating an integral of the form Z 1 2 (B.1) dSP (S ) e¡ 2 lS Cc S ,
where P ( S ) is the source prior. The source priors considered in this article are all of such a form that this integral can reparameterized into an integral over a gaussian kernel. For this reason, it is useful to have at hand an expression for the integral of a gaussian kernel, Z
x
dSe ¡1
¡ 12 lS2 Cc S
D
³ p
³
c 2p D p
l ³
´´ ¡1 Z
x ¡1
1
c
dSe ¡ 2 l(S¡ l )
2
(B.2)
³ ´´ ¡1 Z j p p c 1 2 l 2p D p D dSe ¡ 2 lS l ¡1
(B.3)
D p
(B.4)
W (j )
lD( pc )
,
l
p where j D l(x ¡c / l) . The rst equality follows from completing squares and introducing the gaussian pdf, equation 4.1. The second equality follows
Mean-Field Approaches to Independent Component Analysis
915
by changing the integration variable, whereas the nal equality follows by introducing the gaussian cdf, equation 4.2. We can now calculate the following integral, Z
C1
Z
dSe
Z
1 ¡ W (¡ pc ) W ( pc ) l l (¢) ¡ (¢) D p . D D p lD( pc ) lD( pc ) ¡1 ¡1
¡ 12 lS2 Cc S
0
C1
0
l
(B.5)
l
Suppose we are interested in calculating the variational mean of a density having equation B.5 as the partition function. It is remembered that any factor of proportionality independent of c is not needed in calculating the variational mean, i.e., ³ f (c , l) D D
W D
´ ¡1
Wc0 D ¡ WDc0 D2
³ D
W D
´¡1
p D2 / l C c / lWD D2
D ( pc l ) c C p . l lW ( pc )
(B.6)
(B.7)
l
We can now return to the problem of calculating the variational mean of a positively constrained gaussian parameterized by 1 2 2 p (S|m , s) / e¡ 2 ( S¡m ) / s ,
S 2 RC ,
(B.8)
where m and s 2 are the mean and variance, respectively. Multiplying the source prior onto the gaussian kernel and identifying terms, it is seen that the product can be written as a gaussian with l :D l C 1 / s 2 and c :D c C m / s 2 . Substituting back into equation B.7, we directly obtain the variational mean, f (c , l) D
c C m /s2 1 D (k ) C p , 2 (k ) l C 1/s2 W l C 1/s
(B.9)
where we have introduced c C m /s2
kD p
l C 1/s2
,
(B.10)
and the response function can be readily derived, m /s 2 D l C 1/s2 @c @f
³
D (k ) 1 ¡k ¡ W (k )
³
D (k ) W (k )
´2 ´ .
(B.11)
We now consider the variational mean and the response function associated with the uniform source prior.
916
P. Højen-Sørensen, O. Winther, and L. Hansen
B.2 Uniform Source Prior. In this section we consider the uniform prior,
P (S) D
1 , b ¡a
S 2 [aI b],
(B.12)
where b ¸ a. By reusing the calculations from section B.1, we directly obtain Z
b
Z
dSe ¡ 2 lS 1
2 Cc
S
a
D
b
(¢) ¡
¡1
Z
a
(¢) D
¡1
W (k b ) ¡ W (k a ) p , lD( pc )
(B.13)
l
p p where k x D l(x ¡c / l) D lx ¡ pc . Here, we have again left out the norl malizing constant since it is of no importance in calculating the variational mean, f (c , l) D
c 1 D (ka ) ¡ D (k b ) C p , l l W (kb ) ¡ W (ka )
(B.14)
and the response function, 1 D l @c @f
³
³
ka D (ka ) ¡ kb D (k b ) D (ka ) ¡ D (kb ) 1C ¡ W (kb ) ¡ W (ka ) W (ka ) ¡ W (k b )
´2 ´ .
(B.15)
This appendix showed some examples of the calculations needed in deriving the variational mean and response functions for the source priors considered in this article. Acknowledgments
We thank Michael Jordan and Manfred Opper for helpful discussions. This research is supported by the Danish Research Councils through the THOR Center for Neuroinformatics and by the Center for Biological Sequence Analysis. References Attias, H. (1999). Independent factor analysis. Neural Computation, 11(4), 803– 851. Bell, A. J., & Sejnowski, T. J. (1995). An information–maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129– 1159. Belouchrani, A., & Cardoso, J.-F. (1995). Maximum likelihood source separation by the expectation-maximization technique: deterministic and stochastic implementation. In In Proc. NOLTA (pp. 49–53).
Mean-Field Approaches to Independent Component Analysis
917
CsatÂo, L., FokouÂe, E., Opper, M., Schottky, B., & Winther, O. (2000). Efcient approaches to gaussian process classication. In S. Solla, T. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12 (pp. 251– 257). Cambridge, MA: MIT Press. Girolami, M. (1998). An alternative perspective on adaptive independent component analysis algorithms. Neural Computation, 10(8), 2103–2114. Girolami, M. (Ed.). (2000). Advances in independent components analysis. Berlin: Springer-Verlag. Gradshteyn, I. S., & Ryzhik, I. M. (1980). Table of integrals, series, and products (enl. ed.). New York: Academic Press. Hansen, L. K. (2000). Blind separation of noisy image mixtures. In M. Girolami (Ed.), Advances in independent components analysis (pp. 165–187). Berlin: Springer-Verlag. Hyva¨ rinen, A., & Karthikesh, R. (2000). Sparse priors on the mixing matrix in independent component analysis. In Proc. Int. Workshop on Independent Component Analysis and Blind Signal Separation (ICA2000) (pp. 477–452). Helsinki, Finland. Kappen, H. J., & RodrÂiguez, F. B. (1998). Efcient learning in Boltzmann machines using linear response theory. Neural Computation, 10, 1137–1156. Knuth, K. (1999). A Bayesian approach to source separation. In J.-F. Cardoso, C. Jutten, & P. Loubaton (Eds.), Proceedings of the First International Workshop on Independent Component Analysis and Signal Separation (ICA’99) (pp. 283–288). Aussois, France. Lappalainen, H., & Miskin, J. W. (2000). Ensemble learning. In M. Girolami (Ed.), Advances in independent components analysis. Berlin: Springer-Verlag. Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by nonnegative matrix factorization. Nature, 401, 788–791. Lee, T.-W. (1998). Independent component analysis: Theory and applications. Boston: Kluwer. Lee, T.-W., Lewicki, M. S., Girolami, M., & Sejnowski, T. J. (1999). Blind source separation of more sources than mixtures using overcomplete representations. IEEE Signal Processing Letters, 4(4). Lewicki, M. S., & Sejnowski, T. J. (2000). Learning overcomplete representations. Neural Computation, 12(2), 337–365. Luenberger, D. G. (1984). Linear and nonlinear programming (2nd ed.). Reading, MA: Addison-Wesley. MacKay, D. J. C. (1996). Maximum likelihood and covariant algorithms for independent component analysis (Tech. Rep.). Cambridge University, Cavendish Laboratory. Miskin, J. W., & MacKay, J. C. (2000). Ensemble learning for blind image separation and deconvolution. In M. Girolami (Ed.), Advances in independent components analysis. Berlin: Springer-Verlag. Opper, M., & Winther, O. (2000). Gaussian processes for classication: Mean eld algorithms. Neural Computation, 12(11), 2655–2684. Opper, M., & Winther, O. (2001).Tractable approximations for probabilistic models: The adaptive Thouless-Anderson-Palmer mean eld approach. Phys. Rev. Lett., 86, 3695–3699.
918
P. Højen-Sørensen, O. Winther, and L. Hansen
Parisi, G. (1988). Statistical eld theory. Reading, MA: Addison-Wesley. Petersen, K. S., Hansen, L. K., Kolenda, T., Rostrup, E., & Strother, S. (2000). On the independent components in functional neuroimages. In P. Pajunen & J. Karhunen (Eds.), Proc. Int. Workshop on Independent Component Analysis and Blind Signal Separation (ICA2000) (pp. 615–620). Helsinki, Finland. Peterson, C., & Anderson, J. R. (1987). A mean eld theory learning algorithm for neural networks. Complex Systems, 1, 995–1019. Peterson, C., & Soderberg, ¨ B. (1989). A new method for mapping optimization problems onto neural networks. International Journal of Neural Systems, 1, 3– 21. Pierro, A. R. (1993). On the relation between the ISRA and the EM algorithm for positron emission tomography. IEEE Transactions on Medical Imaging, 12(2), 328–333. Saul, L. K., Jaakkola, T., & Jordan, M. I. (1996). Mean eld theory for sigmoid belief networks. Journal of Articial Intelligence Research, 4, 61–76. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464. Thouless, D. J., Anderson, P. W., & Palmer, R. G. (1977). Solution of “solvable model of a spin glass.” Philosophical Magazine, 35(3), 593–601. van der Veen, A.-J. (1997). Analytical method for blind binary signal separation. IEEE Trans. on Signal Processing, 45(4), 1078–1082. Received December 27, 2000; accepted June 25, 2001.
LETTER
Communicated by Bob Williamson
Neural Networks with Local Receptive Fields and Superlinear VC Dimension Michael Schmitt
[email protected] ¨ Mathematik Lehrstuhl Mathematik und Informatik, Fakult¨at fur Ruhr-Universit¨at Bochum, D–44780 Bochum, Germany Local receptive eld neurons comprise such well-known and widely used unit types as radial basis function (RBF) neurons and neurons with centersurround receptive eld. We study the Vapnik-Chervonenkis (VC) dimension of feedforward neural networks with one hidden layer of these units. For several variants of local receptive eld neurons, we show that the VC dimension of these networks is superlinear. In particular, we establish the bound (W log k) for any reasonably sized network with W parameters and k hidden nodes. This bound is shown to hold for discrete center-surround receptive eld neurons, which are physiologically relevant models of cells in the mammalian visual system, for neurons computing a difference of gaussians, which are popular in computational vision, and for standard RBF neurons, a major alternative to sigmoidal neurons in articial neural networks. The result for RBF neural networks is of particular interest since it answers a question that has been open for several years. The results also give rise to lower bounds for networks with xed input dimension. Regarding constants, all bounds are larger than those known thus far for similar architectures with sigmoidal neurons. The superlinear lower bounds contrast with linear upper bounds for single local receptive eld neurons also derived here. 1 Introduction
The receptive eld of a neuron is the region of the input domain giving rise to stimuli to which the neuron responds by changing its behavior. Neuron models can be classied according to whether these stimuli are contained in some bounded region or may come from afar—in other words, whether their receptive eld is local or not. Prominent examples of local receptive eld models are neurons with center-surround receptive eld and neurons computing radial basis functions (RBFs), whereas sigmoidal neurons represent a widely used model type having a nonlocal receptive eld. The impressive computational and learning capabilities of neural networks, being signicantly higher than those of single neurons, are well established by experimental ndings in biology, innumerable successful applications in Neural Computation 14, 919–956 (2002)
° c 2002 Massachusetts Institute of Technology
920
Michael Schmitt
practice, and substantial formal arguments in the theories of computation, approximation, and learning. An evident question is to what extent these network capabilities depend on the receptive eld type of the neurons. An extensively studied measure for quantifying the computational and learning capabilities of formal systems is the Vapnik-Chervonenkis (VC) dimension. It characterizes the expressiveness of a neural network and is mostly given in terms of the number of network parameters and the network size. A well-known fact is that the VC dimension of sigmoidal neural networks is signicantly more than linear. This has been shown for networks growing in depth (Koiran & Sontag, 1997; Bartlett, Maiorov, & Meir, 1998) and for constant-depth networks with the number of hidden layers being two (Maass, 1994) and one (Sakurai, 1993). The two latter results for networks of constant depth are of particular signicance since they deal with neural architectures as they are used in practice, where one rarely allows the number of hidden layers to grow indenitely. Moreover, the case of one hidden layer is of even greater interest, because single neurons almost always have a VC dimension that is linear in the input dimension and, hence, linear in the number of model parameters.1 This has been found for sigmoidal neurons (Cover, 1965; Haussler, 1992) (see also Anthony & Bartlett, 1999) and for several other models, such as higher-order sigmoidal neurons with restricted degree (Anthony, 1995), for the neuron computing Boolean monomials (Natschlager ¨ & Schmitt, 1996), and for the product unit (Schmitt, 2000). Thus, the fact that networks with the minimal number of one hidden layer have superlinear VC dimension corroborates the enormous computational capabilities arising when sigmoidal neurons cooperate in networks. In this article, we study networks with one hidden layer of local receptive eld neurons. We show for several types of receptive elds that the VC dimension of these networks is superlinear. First, we consider discrete models of cells with center-surround receptive eld (CSRF). The rst real neurons to be identied as having a receptive eld with center-surround organization are the ganglion cells in the visual system of the cat. The recording experiments of Kufer (1953) from the optic nerve revealed the pure onand off-type responses within specic areas of ganglion cell receptive elds and their concentric, antagonistic center-surround organization. Also, other neurons of the mammalian visual system such as the bipolar cells of the retina and cells in the lateral geniculate nucleus are known to have centersurround receptive elds (see, e.g., Tessier-Lavigne, 1991; Nicholls, Martin, & Wallace, 1992). CSRF neurons play an important role in algorithmic experiments with self-organizing networks. The question how center-surround receptive elds can emerge in articial networks by adjusting their parameters has been investigated using unsupervised (Linsker, 1986, 1988; Atick
1 A notable exemplar of a single neuron having superlinear VC dimension is the model of a spiking neuron studied by Maass and Schmitt (1999).
Neural Networks with Local Receptive Fields
921
& Redlich, 1993; Schmidhuber, Eldracher, & Foltin, 1996) as well as supervised learning mechanisms (Joshi & Lee, 1993; Yasui, Furukawa, Yamada, & Saito, 1996). It is found that cells similar to those of the rst few stages of the mammalian visual system develop when applying simple learning rules and using training data from realistic visual scenes. Neural networks with center-surround receptive elds have also been fabricated in analog VLSI hardware. The silicon retinas constructed by Mead and Mahowald (1988) (see also Mead, 1989), Ward and Syrzycki (1995), and Liu and Boahen (1996) consist of neuromorphic cells performing operations of biological receptive elds. The second type of receptive eld neuron we consider is called the difference-of-gaussians (DOG) neuron and is a continuous version of the above model. It also has its origin in neurobiology. Extending the work of Kufer (1953), Rodieck (1965) was probably the rst to introduce the DOG as a quantitative model for the functional responses of ganglion cells. The importance and the physiological plausibility of the DOG for satisfactorily tting functions to experimental data from retinal ganglion cell recordings is also demonstrated by the work of Enroth-Cugell and Robson (1966). The DOG is generally accepted as a mathematical description of the behavior of several cell types in the retino-cortical pathway,2 such as the abovementioned bipolar cells, ganglion cells, and cells in the lateral geniculate nucleus (Marr & Hildreth, 1980; Marr, 1982; Glezer, 1995). Models based on DOG functions may even provide better descriptions of experimental data than other common models of visual processing, as shown by Hawken and Parker (1987) in their study of the monkey primary visual cortex. Thethird and last type of local receptive eld neuron in this study is the RBF neuron, specically the standard, that is, gaussian RBF neuron. RBF networks are among the major neural network types used in practice (see, e.g., Bishop, 1995; Ripley, 1996). They are appreciated because of their powerful capabilities in function approximation and learning that are also theoretically well founded. A series of articles specically deals with showing that under rather mild conditions, RBF networks can uniformly approximate continuous functions on compact domains arbitrarily closely (Hartman, Keeler, & Kowalski, 1990; Park & Sandberg, 1991, 1993; Mhaskar, 1996). Even before RBF networks were considered as articial neural networks, they were a well-established method for multivariable interpolation and function approximation. A comprehensive account on the approximation theory of RBFs up to 1990 is given by Powell (1992). The connections between approximation theory and learning in adaptive networks of RBF neurons were initially explored by Broomhead and Lowe (1988) and Poggio 2 In particular, Marr and Hildreth (1980) prove that under certain conditions, the DOG operator closely approximates the Laplacian-of-gaussians, also known as Marr lter, which they show to be well suited to detect intensity changes and, especially, edges in images.
922
Michael Schmitt
and Girosi (1990). Moody and Darken (1989) studied learning algorithms for RBF networks that can be implemented as real-time adaptive systems. They showed that combined supervised and unsupervised learning methods can be computationally faster in RBF networks than the gradient-based methods devised for sigmoidal networks. (See Howlett and Jain, 2001a, 2001b, and Yee & Haykin, 2001 for recent developments in RBF neural networks.) There has been previous work on the VC dimension of RBF networks. Bartlett and Williamson (1996) show that the VC dimension and the related pseudodimension of RBF networks with discrete inputs is O ( W log(WD) ) , where W is the number of network parameters and the inputs take on values from f¡D, . . . , Dg. The best-known upper bound for RBF networks with unconstrained inputs is due to Karpinski and Macintyre (1997) and is O ( W 2 k2 ) , where k denotes the number of network nodes. Holden and Rayner (1995) address the generalization capabilities of networks having RBF units with xed parameters and establish a linear upper bound on the VC dimension. Anthony and Holden (1994) consider fully adaptable RBF networks with adjustable hidden and output node parameters. Referring to the lower-bound V ( W log W ) for sigmoidal networks due to Maass (1994), they write “we leave as an open question whether it is possible to obtain a lower bound similar to that recently proved by Maass (1993, 1994) for certain feedforward networks” (p. 104). The work of Erlich, Chazan, Petrack, and Levy (1997) together with a result of Lee, Bartlett, and Williamson (1995) gives a linear lower bound (see also Lee, Bartlett, & Williamson, 1997). Although there exists already a large collection of VC dimension bounds for neural networks, it has not been known thus far whether the VC dimension of RBF neural networks is superlinear. Major reasons for this might be that previous results establishing superlinear bounds are based on methods geared to sigmoidal3 neurons or consider networks having an unrestricted number of layers4 (Sakurai, 1993; Maass, 1994; Koiran & Sontag, 1997; Bartlett et al., 1998). In this article, we prove that the VC dimension of RBF networks is indeed superlinear, thus answering the question of Anthony and Holden (1994) quoted above. Precisely, we show that every network with n input nodes, W parameters, and one hidden layer of k RBF neurons, where k · 2 ( n C 2) / 2 , has VC dimension5 V ( W log k) . Thus, the cooperative network effect ob-
3 For a quite general denition of a sigmoidal neuron that does not capture RBF neurons, see Koiran and Sontag (1997). 4 We point out that it might be possible to obtain superlinear lower bounds for local receptive eld networks pursuing the approaches of Koiran and Sontag (1997) and Bartlett et al. (1998), but only at the expense of allowing arbitrary depth. In particular, this has no relevance for standard RBF networks. 5 Note that this result also gives rise to the lower bound V (W log W) by choosing a network with k D n hidden units.
Neural Networks with Local Receptive Fields
923
served in sigmoidal networks is also existent in RBF networks. This result also has implications for the complexity of learning with RBF networks, all the more since it entails the same lower bound for the related notions of pseudodimension and fat-shattering dimension. We do not state these consequences explicitly here but refer readers to Anthony and Bartlett (1999) instead. Before establishing the lower bound for RBF networks, however, we show that the bound V ( W log k) holds for the VC dimension of DOG networks. From this, the bound for RBF networks is then immediately obtained. The result for DOG networks, in turn, is derived from the superlinear lower bound for discrete CSRF networks that we establish rst. Thus, this work creates a link between these three neuron models not only by focusing on their common receptive eld property but also by the logical requisite of the successive proofs of the VC dimension bounds. We introduce denitions and notation in section 2. The two subsequent sections contain the derivations of the superlinear lower bounds. In section 3, we consider networks of discrete local receptive eld neurons, specifically the binary and ternary CSRF neuron and a discrete variant of the RBF neuron, the binary RBF neuron. In section 4, we study networks of DOG neurons and standard RBF neural networks. We note that all bounds derived in both these sections have larger constant factors than those known for sigmoidal networks of constant depth thus far. In particular, we obtain the bound ( W / 5) log(k / 4) for binary CSRF neurons and for DOG neurons, and the bound ( W / 12) log(k/ 8) for ternary CSRF neurons, binary RBF neurons, and gaussian RBF neurons. For comparison, sigmoidal networks are known with one hidden layer and VC dimension at least ( W / 32) log(k/ 4), and with two hidden layers and VC dimension at least ( W / 132) log(k / 16) (see Anthony & Bartlett, 1999, section 6.3). The results in sections 3 and 4 also give rise to lower bounds for local receptive eld networks when the input dimension is xed. In section 5, we present upper and lower bounds for single neurons. In particular, we show that the VC dimension of binary RBF and CSRF neurons and of ternary CSRF neurons is linear. Further, we derive such a result for the pseudodimension of the gaussian RBF neuron. Finally, in section 6, we return to networks of discrete local receptive eld neurons and establish the upper bound O (W log k) for all discrete variants of local receptive eld neurons considered here. This implies that the lower bounds for these networks are asymptotically optimal. We conclude with section 7 discussing the results and presenting some open questions. An appendix gives the derivation of a bound for a specic class of functions dened in terms of halfspaces and required for a result in section 5.1. 2 Denitions
We rst introduce the types of neurons and networks with local receptive elds that we study. Then we give the denitions of the VC dimension and other basic concepts.
924
Michael Schmitt
2.1 Networks of Local Receptive Field Neurons. We start with discrete neurons. Let ||u|| denote the Euclidean norm of vector u. A binary CSRF neuron computes the function gbCSRF : R2n C 2 ! f0, 1g dened as ( 1 if a · kx ¡ ck · b, gbCSRF ( c, a, b, x) D 0 otherwise,
with input variables x1 , . . . , xn , and parameters c1 , . . . , cn , a, b, where b > a > 0. The vector ( c1 , . . . , cn ) is called the center of the neuron, and a, b are its center radius and surround radius, respectively. We also refer to this neuron as binary off-center on-surround neuron and call for given parameters c, a, b the set fx: gbCSRF ( c, a, b, x ) D 1g the surround region of the neuron. A ternary CSRF neuron is dened by means of the function gtCSRF : R2n C 2 ! f¡1, 0, 1g with 8 > < 1 if a · kx ¡ ck · b, gtCSRF (c, a, b, x ) D ¡1 if kx ¡ ck < a, > : 0 otherwise. This neuron is also called ternary off-center on-surround neuron. Finally, a binary RBF neuron computes the function gbRBF : R2n C 1 ! f0, 1g satisfying ( 1 if kx ¡ ck · b, gbRBF ( c, b, x ) D 0 otherwise. Figure 1 shows the receptive eld of these neurons for the case n D 2. We emphasize that the output values in these denitions are meant to be symbolic and represent discrete levels of neural activity. So a 1 corresponds to a state where the neuron is highly active, whereas ¡1 indicates low activity. The value 0 signies that the neuron is silent. Furthermore, the specic assignment of values to the activity levels is not relevant for the results derived in this article. For instance, any two distinct nonzero values A < B instead of ¡1, 1 can be chosen for the ternary CSRF neuron without affecting the validity of the lower bounds. The same holds for the binary CSRF and binary RBF neuron, where the value 1 can be replaced by any other nonzero value. The assignment of output values to points lying on a radius also allows some freedom. For instance, we could alternatively require that for kx¡ck D a, we have gbCSRF ( c, a, b, x ) D 0 or gtCSRF ( c, a, b, x ) D ¡1. The same is true for kx ¡ ck D b. The VC dimension bounds do not rely on the values for the radii and hence still hold for these and other cases. We have dened here only off-center on-surround variants of CSRF neurons. In neurobiology, models of on-center off-surround cells are equally important (see, e.g., Tessier-Lavigne, 1991; Nicholls et al., 1992). In these neurons, the activity in the surround region is lower than in the center.
Neural Networks with Local Receptive Fields
925
surround region
Figure 1: Receptive eld of discrete neurons with center c, center radius a, and surround radius b. Output values A D 0, B D 1 correspond to a binary CSRF neuron, A D ¡1, B D 1 yields a ternary CSRF neuron, and a binary RBF neuron is given by A D B D 1. Outside the regions labeled A or B, the output is always 0.
Since we are considering networks that are weighted combinations of neurons, such a denition is redundant here. An on-center off-surround neuron in a network can be replaced by an off-center on-surround neuron, and vice versa, by simply multiplying the weight outgoing from it with a negative number. The two types of continuous local receptive eld neurons considered in this article are dened as follows. A gaussian RBF neuron computes the function gRBF : R2n C 1 ! R dened as ³ ´ kx ¡ ck2 gRBF ( c, s, x ) D exp ¡ , s2 with input variables x1 , . . . , xn , and parameters c1 , . . . , cn and s > 0. Here ( c1 , . . . , cn ) is the center and s > 0 the width. A DOG neuron is dened as a function gDOG : R2n C 4 ! R computed by the weighted difference of two RBF neurons with equal centers, that is, gDOG ( c, s, t, a, b, x ) D agRBF ( c, s, x) ¡ b gRBF ( c, t, x ) , where s, t > 0. Examples of gRBF and gDOG for input dimension 2 are shown in Figure 2.
926
Michael Schmitt
Figure 2: Receptive eld functions computed by a (left) gaussian radial basis function neuron and a (right) difference-of-gaussians neuron.
The neural networks we are studying are of the feedforward type and have one hidden layer. They compute functions of the form f : RW C n ! R, where W is the number of network parameters, n the number of input nodes, and f is dened as f ( w, y, x) D w0 C w1 h1 ( y, x ) C ¢ ¢ ¢ C wkh k ( y, x) . The k hidden nodes may compute any of the functions dened above, that is, h1 , . . . , h k 2 fgbCSRF , gtCSRF , gbRBF , gRBF , gDOG g. The parameters of the hidden nodes are gathered in y from which each node selects its own parameters. The network has a linear output node with parameters w0 , . . . , wk also known as the output weights. The parameter ¡w0 is also called the output threshold. For simplicity, we sometimes refer to all network parameters as weights. If hi D gRBF for i D 1, . . . , k, we have the standard form of a gaussian RBF neural network. 2.2 Vapnik-Chervonen kis Dimension of Neural Networks. A dichotomy of a set S µ Rn is a pair ( S 0, S1 ) of subsets such that S 0 \ S1 D ; and S 0 [ S1 D S. A class F of functions mapping Rn to f0, 1g is said to shatter S if every dichotomy ( S 0, S1 ) of S is induced by some f 2 F , in the sense that f satises f (S 0 ) µ f0g and f (S1 ) µ f1g. The function sgn: R ! f0, 1g satises sgn ( x ) D 1 if x ¸ 0, and sgn ( x ) D 0 otherwise.
Neural Networks with Local Receptive Fields
927
Let N be a neural network and F be the class of functions computed by N . The Vapnik-Chervonenkis (VC) dimension of N is the cardinality of the largest set shattered by the class fsgn ± f : f 2 F g. Denition 1.
The pseudodimension and the fat-shattering dimension are generalizations of the VC dimension that apply in particular to real valued function classes. The lower bounds for local receptive eld networks presented in this article are stated for the VC dimension, but they also hold for the pseudodimension and the fat-shattering dimension. The bound on the pseudodimension follows from the fact that the VC dimension of a neural network is by denition not larger than its pseudodimension. The bound on the fatshattering dimension is implied because the output weights of the neural networks considered here can be scaled arbitrarily. The denition of the pseudodimension will be given in section 5.2, where we establish a linear upper bound for the single gaussian RBF neuron. We refer the reader to Anthony and Bartlett (1999) for a denition of the fat-shattering dimension and results about the relationship between these three notions of dimension. 2.3 Further Concepts and Notation. An ( n¡1) -dimensional hyperplane in Rn is represented by a vector (w 0, . . . , wn ) 2 Rn C 1 and dened as the set
fx 2 Rn : w 0 C w1 x1 C ¢ ¢ ¢ C wn xn D 0g. An ( n ¡ 1)-dimensional hypersphere in Rn is given by a center c 2 Rn and a radius r > 0, and dened as the set fx 2 Rn : kx ¡ ck D rg. We clearly distinguish the hypersphere from a ball, which is dened as the set fx 2 Rn : kx ¡ ck · rg. We also consider hyperplanes and hyperspheres in Rn with a dimension k < n ¡ 1. In this case, a k-dimensional hyperplane is the intersection of two ( k C 1) -dimensional hyperplanes, assuming that the intersection is nonempty. Similarly, the nonempty intersection of two ( k C 1) -dimensional hyperspheres yields a k-dimensional hypersphere, provided that the intersection is not a single point. We use “ln” to denote the natural logarithm and “log” for the logarithm to base 2. 3 Superlinear Lower Bounds for Networks of Discrete Neurons
In this section we establish superlinear lower bounds for networks consisting of discrete versions of local receptive eld neurons. Crucial is the result for binary CSRF networks, presented in section 3.2, from which the bounds
928
Michael Schmitt
Figure 3: Positive and negative examples for sets in spherically general position (see denition 2). The set fx1 , x2 , x3 g has a line passing through its points, and the set fx1 , x2 , x4 , x5 g lies on a circle. Hence, any set that includes one of these sets (or both) is not in spherically general position since they violate conditions 1 and 2, respectively. A positive example is the set fx2 , x3 , x4 , x5 , x6 g.
for ternary CSRF networks and binary RBF networks in sections 3.3 and 3.4, respectively, follow. First, however, we introduce a geometric property of certain nite sets of points. 3.1 Geometric Preliminaries
A set S of m points in Rn is said to be in spherically general position if the following two conditions are satised: Denition 2.
1. For every k · min(n, m ¡ 1) and every ( k C 1) -element subset P µ S, there is no ( k ¡ 1)-dimensional hyperplane containing all points in P. 2. For every l · min(n, m ¡ 2) and every ( l C 2) -element subset Q µ S, there is no ( l ¡ 1)-dimensional hypersphere containing all points in Q. The denition is illustrated by Figure 3 showing six points in R2 . The entire set is not in spherically general position, as witnessed by the line and the circle. It is easy, but may take a while, to verify that the set fx2 , x3 , x4 , x5 , x6 g is indeed in spherically general position. Sets satisfying only condition 1 are commonly referred to as being in general position (see, e.g., Cover, 1965; Nilsson, 1990). Thus, a set in spherically general position is particularly in general position. (The converse does not always hold, as can be seen from Figure 3: The set fx1 , x2 , x4 , x5 g meets condition 1 but not condition 2.) For establishing the superlinear lower bounds
Neural Networks with Local Receptive Fields
929
on the VC dimension, we require sets in spherically general position with sufciently many elements. It is easy to show that for any dimension n, there exist arbitrarily large such sets. The proof of the following proposition provides a method for constructing them. For every n, m ¸ 1 there exists a set S µ Rn of m points in spherically general position. Proposition 1.
Proof. We perform induction on m. Clearly, every point trivially satises
conditions 1 and 2. Assume that some set S µ Rn of cardinality m has been constructed. Then by the induction hypothesis, for every k · min(n, m ), every k-element subset P µ S does not lie on a hyperplane of dimension less than k ¡ 1. Hence, every P µ S, |P| D k · min(n, m ) , uniquely species a ( k ¡ 1) -dimensional hyperplane HP that includes P. The induction hypothesis implies further that no point in SnP lies on HP . Analogously, for every l · min(n, m ¡ 1), every (l C 1)-element subset Q µ S does not lie on a hypersphere of dimension less than l ¡ 1. Thus, every Q µ S, |Q| D l C 1 · min(n, m ¡ 1) C 1, uniquely determines an ( l ¡ 1) -dimensional hypersphere BQ containing all points in Q and none of the points in SnQ. To obtain a set of cardinality m C 1 in spherically general position, we observe that the union of all hyperplanes and hyperspheres considered above—that is, the union of all HP and all BQ for all subsets P and Q—has Lebesgue measure 0. Hence, there is some point s 2 Rn not contained in any hyperplane HP and not contained in any hypersphere BQ . By adding s to S, we then obtain a set of cardinality m C 1 in spherically general position. 3.2 Networks of Binary CSRF Neurons. The following theorem is the main step in establishing the superlinear lower bound.
Let h, q, m ¸ 1 be arbitrary natural numbers. Suppose N is a network with one hidden layer consisting of binary CSRF neurons, where the number of hidden nodes is h C 2q and the number of input nodes is m C q. Assume further that the output node is linear. Then there exists a set of cardinality hq (m C 1) shattered by N . This holds even if the output weights of N are xed to 1. Theorem 1.
Proof. Before starting with the details, we give a brief outline. The main
idea is to imagine the set we want to shatter as being composed of groups of vectors, where the groups are distinguished by means of the rst m components and the remaining q components identify the group members. We catch these groups by hyperspheres such that each hypersphere is responsible for up to m C 1 groups. The condition of spherically general position will ensure that this operation works. The hyperspheres are then expanded to become surround regions of off-center on-surround neurons. To induce a dichotomy of the given set, we split the groups. We do this for each group
930
Michael Schmitt
using the q last components in such a way that the points with designated output 1 stay within the surround region of the respective neuron and the points with designated output 0 are expelled from it. In order for this to succeed, we have to make sure that the displaced points do not fall into the surround region of some other neuron. The verication of the split operation will constitute the major part of the proof. Let us rst choose the vectors. By means of proposition 1, we select a set fs1 , . . . , sh ( m C 1) g µ Rm in spherically general position. Let e1 , . . . , eq denote the unit vectors in Rq , that is, those with a 1 in exactly one component and 0 elsewhere. We dene the set S by S D fsi : i D 1, . . . , h ( m C 1) g £ fej : j D 1, . . . , qg. Clearly, S is a subset of Rm C q and has cardinality hq ( m C 1) . It remains to show that S is shattered by N . Let ( S 0, S1 ) be some arbitrary dichotomy of S. Consider an enumeration M1 , . . . , M2q of all subsets of the set f1, . . . , qg. Let the function f : f1, . . . , h (m C 1)g ! f1, . . . , 2q g be dened by M f ( i) D fj: si ej 2 S1 g, where si ej denotes the vector resulting from the concatenation of si and ej . We use f to dene a partition of fs1 , . . . , sh ( m C 1) g into sets T k for k D 1, . . . , 2q by T k D fsi : f ( i) D kg.
We further partition each set T k into subsets T k,p for p D 1, . . . , d|T k | / ( m C 1) e, where each subset T k,p has cardinality m C 1 except if m C 1 does not divide |T k |, in which case there is exactly one subset of cardinality less than m C 1. Since there are at most h ( m C 1) elements si , the partitioning of all T k results in no more than h subsets of cardinality m C 1. Further, the fact k · 2q permits at most 2q subsets of cardinality less than m C 1. Thus, there are no more than h C 2q subsets T k,p . We employ one hidden node H k,p for each subset T k,p . Thus, we get by with h C 2q hidden nodes in N as claimed. Since fs1 , . . . , sh ( m C 1) g is in spherically general position, there exists for each T k,p an ( m¡1 ) -dimensional hypersphere containing all points in T k,p and no other point. If |T k,p | D m C 1, this hypersphere is unique; if |T k,p | < m C 1, there is a unique (| T k,p | ¡ 2) dimensional hypersphere that can be extended to an ( m ¡ 1) -dimensional hypersphere that does not contain any further point. (Note that we require condition 1 of denition 2; otherwise no hypersphere of dimension |T k,p | ¡ 2 including all points of T k,p might exist.) Clearly, if |T k,p | D 1, we can also extend this single point to an ( m ¡ 1)-dimensional hypersphere, not including any further point. Suppose that ( ck,p , rk,p ) with center ck,p and radius r k,p represents the hypersphere associated with subset T k,p . It is obvious from the construction above that all radii satisfy rk,p > 0. Further, since the subsets T k,p are pairwise
Neural Networks with Local Receptive Fields
931
disjoint, there is some e > 0 such that every point si 2 fs1 , . . . , sh ( m C 1) g and every just dened hypersphere ( ck,p , r k,p ) satisfy 6 T k,p then |ks i ¡ c k,p k ¡ r k,p | > e. if si 2
(3.1)
In other words, e is smaller than the distance between any si and any hypersphere (ck,p , r k,p ) that does not contain si . Without loss of generality, we assume that e is sufciently small such that
e · min r k,p .
(3.2)
k,p
The parameters of the hidden nodes are adjusted as follows. We dene the center b ck,p D (b ck,p,1 , . . . , b ck,p,m C q ) of hidden node Hk,p by assigning the vector ck,p to the rst m components and specifying the remaining ones by ( 0 if j 2 M k, b ck,p,m C j D 2 ¡e / 4 otherwise, for j D 1, . . . , q. We further dene new radiib r k,p by r ± e ²4 b C1 rk,p D r2k,p C ( q ¡ |M k |) 2 and choose some c > 0 satisfying c · min k,p
e2 . 8b rk,p
(3.3)
The center and surround radiib ak,p , b bk,p of the hidden nodes are then specied as b a k,p D b rk,p ¡ c , b bk,p D b rk,p C c . 2 Note that b implies c < b a k,p > 0 holds, because e 2 < b r k,p r k,p . This completes the assignment of parameters to the hidden nodes H k,p . We now derive two inequalities concerning the relationship between e and c that we need in the following. First, we estimate e 2 / 2 from below by
e2 e2 e2 C > 2 4 64 >
e2 e4 C (8b 4 r k,p ) 2
for all k, p,
932
Michael Schmitt
2 where the last inequality is obtained from e 2 < b . Using equation 3.3 for r k,p both terms on the right-hand side, we get
e2 > 2b r k,pc C c 2
2
for all k, p.
(3.4)
Second, from equation 3.2, we get ¡r k,p e C
e2 e2 < ¡ 2 4
for all k, p,
and equation 3.3 yields ¡
e2 < ¡2b rk,pc 4
for all k, p.
Putting the last two inequalities together and adding c side, we obtain ¡ rk,p e C
e2 < ¡2b rk,pc C c 2
2
2
to the right-hand
for all k, p.
(3.5)
We next establish three facts about the hidden nodes. Claim i. Let si ej be some point and T k,p some subset where si 2 T k,p and j 2 M k. Then hidden node Hk,p outputs 1 on si ej . According to the denition of b ck,p , if j 2 M k , we have ksi ej ¡b ck,p k2 D ksi ¡ ck,p k2 C ( q ¡ |M k |)
± e ²4 2
C 1.
The condition si 2 T k,p implies ksi ¡ ck,p k2 D r2k,p , and thus ksi ej ¡b ck,p k2 D r2k,p C (q ¡ |M k | )
± e ²4 2
C 1.
2 . Db rk,p
It follows that ksi ej ¡ b ck,p k D b rk,p , and since b a k,p < b rk,p < b bk,p , point si ej lies within the surround region of node H k,p . Hence, claim i is shown. 6 M k . Then hidden node Claim ii. Let si ej and T k,p satisfy si 2 T k,p and j 2 H k,p outputs 0 on si ej .
Neural Networks with Local Receptive Fields
933
From the assumptions, we get here ± e ²4
ksi ej ¡b ck,p k2 D ksi ¡ ck,p k2 C ( q ¡ |M k | ¡ 1) D r2k,p C ( q ¡ |M k |) 2 C Db rk,p
e2 2
± e ²4 2
2
C1C
³ C
1C
e2 4
´2
e2 2
.
Employing equation 3.4 on the right-hand side results in 2 C 2b ksi ej ¡b ck,p k2 > b rk,p rk,pc C c 2 .
Hence, taking square roots, we have ksi ej ¡ b ck,p k > b rk,p C c , implying that si ej lies outside the surround region of H k,p . Thus, claim ii follows. Claim iii. Let si ej be some point and T k,p some subset such that si 2 T k,p . 6 ( k, p ) outputs 0 on si ej . Then every hidden node H k0 ,p0 with ( k0 , p0 ) D Since si 2 T k,p and si is not contained in any other subset T k0 ,p0 , condition 3.1 implies ksi ¡ ck0 ,p0 k2 > ( rk0 ,p0 C e ) 2 or ksi ¡ ck0 ,p0 k2 < (r k0 ,p0 ¡ e ) 2 .
(3.6)
We distinguish between two cases: whether j 2 M k0 or not. Case 1. If j 2 M k0 , then by the denition of b ck0 ,p0 we have
ksi ej ¡b ck0 ,p0 k2 D ksi ¡ ck0 ,p0 k2 C ( q ¡ |M k0 | )
± e ²4 2
C 1.
From this, using equation 3.6 and the denition ofb r k0 ,p0 , we obtain ksi ej ¡b ck0 ,p0 k2 ksi ej ¡b ck0 ,p0 k2
>
b r2k0 ,p0 C 2rk0 ,p0 e C e 2
or
4b rk0 ,p0 c C 2c 2 ,
934
Michael Schmitt
which, after adding 2rk0 ,p0 e to the left-hand side and halving the right-hand side, gives 2rk0 ,p0 e C e 2 > 2b rk0 ,p0 c C c 2 .
(3.8)
From equation 3.2, we get e 2 / 2 < rk0 ,p0 e , that is, the left-hand side of equation 3.5 is negative. Hence, we may double it to obtain from equation 3.5, ¡2rk0 ,p0 e C e 2 < ¡2b rk0 ,p0c C c 2 . Using this and equation 3.8 in 3.7 leads to ksi ej ¡b ck0 ,p0 k2 > (b rk0 ,p0 C c ) 2 or ksi ej ¡b ck0 ,p0 k2 < (b rk0 ,p0 ¡ c ) 2 . And this is equivalent to ksi ej ¡b ck0 ,p0 k > b b k0 ,p0 or ksi ej ¡b ck0 ,p0 k < b a k0 ,p0 , meaning that Hk0 ,p0 outputs 0. 6 M k0 then, Case 2 . If j 2
ksi ej ¡b ck0 ,p0 k2 D ksi ¡ ck0 ,p0 k2 C (q ¡ |M k0 |)
± e ²4 2
C1C
e2 . 2
As a consequence of this, together with equation 3.6 and the denition of b rk0 ,p0 we get ksi ej ¡b ck0 ,p0 k2 ksi ej ¡b ck0 ,p0 k
2
>
b rk20 ,p0 C 2r k0 ,p0 e C e 2 C
e2 2
b rk20 ,p0 ¡ 2r k0 ,p0 e C e 2 C
e2 2,
or
b rk20 ,p0 C rk0 ,p0 e C
e2 2
or ksi ej ¡b ck0 ,p0 k2
2b rk0 ,p0 C c 2 , 2
and, employing this together with equation 3.5, we obtain from equation 3.9 ksi ej ¡b ck0 ,p0 k2 > (b rk0 ,p0 C c ) 2 or ksi ej ¡b ck0 ,p0 k2 < (b rk0 ,p0 ¡ c ) 2 , which holds if and only if ksi ej ¡b ck0 ,p0 k > b bk0 ,p0 or ksi ej ¡b ck0 ,p0 k < b ak0 ,p0 . This shows that Hk0 ,p0 outputs 0 also in this case. Thus, claim iii is established. We complete the network N by connecting every hidden node with weight 1 to the output node, which then computes the sum of the hidden node output values. We nally show that we have indeed obtained a network that induces the dichotomy ( S 0, S1 ). Assume that si ej 2 S1 . Claims i, ii, and iii imply that there is exactly one hidden node H k,p , namely, one satisfying k D f ( i) by the denition of f , that outputs 1 on si ej . Hence, the network outputs 1 as well. On the other hand, if si ej 2 S 0 , it follows from claims ii and iii that none of the hidden nodes outputs 1. Therefore, the network output is 0. Thus, N shatters S with output threshold 1 / 2, and the proof is completed. The construction in the previous proof was based on the assumption that the difference between center radius and surround radius, given by the value 2c , can be made sufciently small. This may require constraints on the precision of computation that are not available in natural or articial systems. It is possible, however, to obtain the same result even if there is a lower bound on the difference of the radii. One simply has to scale the elements of the shattered set by a sufciently large factor. We apply the result now to obtain a superlinear lower bound for the VC dimension of networks with center-surround receptive eld neurons. By bxc we denote the largest integer less or equal to x. Suppose N is a network with one hidden layer of k binary CSRF neurons and input dimension n ¸ 2, where k · 2n , and assume that the output Corollary 1.
936
Michael Schmitt
node is linear. Then N
has VC dimension at least
µ ¶ µ ³ ´¶ ³ µ ³ ´¶ ´ k k k C1 . ¢ log ¢ n ¡ log 2 2 2 This holds even if the weights of the output node are not adjustable. Proof. We use theorem 1 with h D bk / 2c, q D blog(k / 2) c, and m D n ¡ blog(k / 2) c. The condition k · 2n guarantees that m ¸ 1. Then there is a set
of cardinality
µ ¶ µ ³ ´¶ ³ µ ³ ´¶ ´ k k k C1 , ¢ log ¢ n ¡ log hq ( m C 1) D 2 2 2 that is shattered by the network specied in theorem 1. Since the number of hidden nodes is h C 2q · k and the input dimension is m C q D n, the network satises the required conditions. Furthermore, it was shown in the proof of theorem 1 that all weights of the output node can be xed to 1. Hence, they need not be adjustable. VC dimension bounds for neural networks are often expressed in terms of the number of weights and the network size. In the following, we give a lower bound of this kind. Consider a network N with input dimension n ¸ 2, one hidden layer of k binary CSRF neurons, where k · 2n/ 2 , and a linear output node. Let W D k ( n C 2) C k C 1 denote the number of weights. Then N has VC dimension at least Corollary 2.
³ ´ W k log . 5 4 This holds even in the case when the weights of the output node are xed. According to corollary 1, N has VC dimension at least bk / 2c ¢ blog(k / 2) c ¢ (n ¡ blog(k/ 2)c C 1) . The condition k · 2n / 2 implies Proof.
µ
³ ´¶ k nC4 C1¸ . n ¡ log 2 2
Neural Networks with Local Receptive Fields
937
We may assume that k ¸ 5. (The statement is trivial for k · 4.) It follows, using bk / 2c ¸ ( k ¡ 1) / 2 and k / 10 ¸ 1 / 2, that µ ¶ 2k k ¸ . 2 5 Finally, we have µ log
³ ´¶ ³ ´ ³ ´ k k k ¸ log ¡ 1 D log . 2 2 4
Hence, N has VC dimension at least ( n C 4) ( k / 5) log(k / 4) , which is at least as large as the claimed bound ( W / 5) log(k/ 4). In the networks considered thus far, the input dimension was assumed to be variable. It is an easy consequence of theorem 1 that even when n is constant, the VC dimension grows still linearly in terms of the network size. Assume that the input dimension is xed and consider a network with one hidden layer of binary CSRF neurons and a linear output node. Then the VC dimension of N is V ( k) and V ( W ) , where k is the number of hidden nodes and W the number of weights. This even holds in the case of xed output weights. Corollary 3.
N
Proof. Choose m, q ¸ 1 such that m C q · n, and let h D k ¡ 2q . Since n is
constant, hq ( m C 1) is V ( k) . Thus, according to theorem 1, there is a set of cardinality V ( k) shattered by N . Since the number of weights is k (n C 3) C 1, which is O ( k) , the lower bound V (W ) also follows. 3.3 Networks of Ternary CSRF Neurons. The results from the previous section now easily allow deriving similar bounds for ternary CSRF neurons. The following statement is the counterpart of theorem 1.
Suppose N is a network with one hidden layer consisting of ternary CSRF neurons and a linear output node. Let 2(h C 2q ) be the number of hidden nodes and m C q the number of input nodes, where h, m, q ¸ 1. Then there exists a set of cardinality hq ( m C 1) shattered by N . This holds even for xed output weights. Theorem 2.
Proof. The idea is to use the same set S as in the proof of theorem 1 and
to simulate the behavior of a binary neuron by two ternary neurons. Let N b be the network constructed in the proof of theorem 1. Assume rst that we have off-center on-surround neurons available for the construction of N . b of N b , we introduce two hidden nodes H, H0 for For every hidden node H N dening their parameters as follows: Node H gets the same center and
938
Michael Schmitt
b Node H 0 also gets the same center, but for the center radius we radii as H. b choose 0, and the surround radius is dened to be the center radius of H. 0 Formally, H can be regarded as an off-center on-surround neuron without center region. It is easy to see that on any input vector from S, the sum of the output b Here we use the fact values of H and H 0 is equal to the output value of H. that no point in S lies on the radii of any hidden node of N b . Hence, by what was shown in theorem 1, a dichotomy ( S 0, S1 ) is accomplished by the sum of the output values of the hidden nodes, being 0 for elements of S 0 and 1 for elements of S1 . In case we are dealing with on-center off-surround neurons, we use the property that they are negatives of off-center on-surround neurons. Thus, dening the corresponding output weight to be ¡1 instead of 1, we obtain the same result. In analogy to the previous section, we are now able to infer three lower bounds: a superlinear bound in terms of input dimension and network size, a superlinear bound in terms of weight number and network size, and a linear bound for xed input dimension. Corollary 4. Suppose N is a network with input dimension n ¸ 2, one hidden layer of k · 2n C 1 ternary CSRF neurons, and a linear output node. Then N has
VC dimension at least
µ ¶ µ ³ ´¶ ³ µ ³ ´¶ ´ k k k C1 , ¢ log ¢ n ¡ log 4 4 4 even for xed output weights. Proof. From k · 2n C 1 follows that blog(k / 4) c · n ¡ 1. Hence, applying
theorem 2 with h D bk/ 4c, q D blog(k/ 4)c, m D n ¡blog(k / 4) c, and observing that 2(h C 2q ) · k, we immediately obtain the claimed result.
Consider a network N with input dimension n ¸ 2, one hidden layer of k ternary CSRF neurons, where k · 2 ( n C 2) / 2 , and a linear output node. Let W D k ( n C 3) C 1 denote the number of weights. Then N has VC dimension at least Corollary 5.
³ ´ W k log , 12 8 even for xed output weights.
Neural Networks with Local Receptive Fields
939
Proof. Since k · 2 ( n C 2) / 2 , we have ( n ¡ blog(k/ 4)c C 1) ¸ ( n C 4) / 2. Further, bk/ 4c ¸ ( k ¡ 3) / 4 implies for k ¸ 9 that bk / 4c ¸ k / 6. (The statement is trivial for k · 8.) Using these estimates in the bound of corollary 4 together with blog(k / 4) c ¸ log(k/ 8) gives the result.
Assume that the input dimension n ¸ 2 is xed, and consider a network N with one hidden layer of ternary CSRF neurons and a linear output node. Then the VC dimension of N is V ( k) and V ( W ) , where k is the number of hidden nodes and W the number of weights. This holds even for xed output weights. Corollary 6.
Proof. The result can be deduced from theorem 2 by analogy with corol-
lary 3. 3.4 Networks of Binary RBF Neurons. Finally, we consider the third variant of a discrete local receptive eld neuron and show that networks of binary RBF neurons also respect the bounds established above for ternary CSRF neurons.
Suppose N is a network with one hidden layer consisting of binary RBF neurons and a linear output node. Let n ¸ 2 be the input dimension, k the number of hidden nodes, and assume that k · 2n C 1 . Then N has VC dimension at least Theorem 3.
µ ¶ µ ³ ´¶ ³ µ ³ ´¶ ´ k k k C1 . ¢ log ¢ n ¡ log 4 4 4 Let W denote the number of weights and assume that k · 2 ( n C 2) / 2 . Then the VC dimension of N is at least ³ ´ W k log . 12 8 For xed input dimension n ¸ 2, the VC dimension of N satises the bounds V ( k) and V (W ) . All these bounds are valid even when the output weights are xed. Proof. The main idea is to employ two binary RBF neurons for the simula-
tion of one binary CSRF neuron. This is easy to achieve. Given a neuron of the latter type, we provide the two RBF neurons with its center and assign its center radius to the rst and its surround radius to the second neuron. If we give output weights ¡1 and 1 to the the rst and second neuron, respectively, then it is clear that on points not lying on the radii, the summed ouptut of the weighted RBF neurons is equivalent to the output of the CSRF neuron.
940
Michael Schmitt
Thus, we can do a similar construction as in the proof of theorem 2, obtaining a network of size twice the original network such that both networks shatter the set from theorem 1. We recall that the networks have the property that the parameters can be chosen such that no point of this set lies on any radius. The consequences stated in corollaries 4 to 6 for ternary CSRF neurons then follow immediately for binary RBF neurons. 4 Superlinear Lower Bounds for Networks of Continuous Neurons
We now turn toward networks of continuous local receptive eld neurons. In this section, we rst establish lower bounds for networks of DOG neurons. Their derivation mainly builds on constructions and results from the previous section. The bounds for RBF networks are then easily obtained. 4.1 Networks of DOG Neurons. We begin by deriving a result in anal-
ogy with theorem 1. Let h, q, m ¸ 1 be arbitrary natural numbers. Suppose N is a network with m C q input nodes, one hidden layer of h C 2q DOG neurons, and a linear output node. Then there is a set of cardinality hq ( m C 1) shattered by N . Theorem 4.
Proof. We use ideas and results from the proof of theorem 1. In particular,
we show that the set constructed there can be shattered by a network of new model neurons, the so-called extended gaussian neurons, which we introduce below. Then we demonstrate that a network of these extended gaussian neurons can be simulated by a network of DOG neurons, which establishes the statement of the theorem. We dene an extended gaussian neuron with n inputs to compute the function g: Q R2n C 2 ! R with ³ ³ ´ ´2 kx ¡ ck2 ¡ 1 gQ (c, s, a, x) D 1 ¡ a exp ¡ , s2 where x1 , . . . , xn are the input variables and c1 , . . . , cn , a, and s > 0 are real-valued parameters. Thus, the computation of an extended gaussian neuron is performed by scaling the output of a gaussian RBF neuron with a, squaring the difference to 1, and comparing this value with 1. Let S µ Rm C q be the set of cardinality hq ( m C 1) constructed in the proof of theorem 1. In particular, S has the form S D fsi ej : i D 1, . . . , h ( m C 1)I j D 1, . . . , qg. We have also dened in that proof binary CSRF neurons H k,p as hidden nodes in terms of parameters b ck,p 2 Rm C q , which became the centers of the
Neural Networks with Local Receptive Fields
941
neurons, and b rk,p 2 R, which gave the center radii b ak,p D b rk,p ¡ c and the surround radii b bk,p D b rk,p C c using some c > 0. The number of hidden nodes was not larger than h C 2q . We replace the CSRF neurons by extended gaussian neurons G k,p with parameters ck,p , sk,p , ak,p dened as follows. Assume some s > 0 that will be specied later. Then we let ck,p D b ck,p , sk,p D s, ak,p D exp
³
2 b r k,p
´
s2
.
These hidden nodes are connected to the output node with all weights being 1. We call this network N 0 and claim that it shatters S. Consider some arbitrary dichotomy ( S 0, S1 ) of S and some si ej 2 S. Then node G k,p computes
³
³
gQ ( ck,p , sk,p , ak,p , si ej ) D 1¡ ak,p exp ¡
³ D 1¡ exp
³ D 1¡ exp
³
2 b rk,p
s2
³ ¡
´
ksi ej ¡ ck,p k2
´
2 sk,p
´2 ¡1
³
´
ksi ej ¡b ck,p k2 ¢ exp ¡ ¡1 s2
2 ksi ej ¡b ck,p k2 ¡b r k,p
s2
´
´2
´2 ¡1
.
(4.1)
Suppose rst that si ej 2 S1 . It was shown by claims i, ii, and iii in the proof of theorem 1 that there is exactly one hidden node H k,p that outputs 1 on si ej . In particular, claim i established that this node satises 2 ksi ej ¡b . ck,p k2 D b rk,p
Hence, according to equation 4.1 node G k,p outputs 1. We note that this holds for all values of s. Further, the derivations of claims ii and iii yielded that those nodes H k,p that output 0 on si ej satisfy ksi ej ¡b ck,p k2 > (b rk,p C c ) 2 or ksi ej ¡b ck,p k2 < (b rk,p ¡ c ) 2 .
(4.2)
This implies for the computation of G k,p that in equation 4.1, we can make the expression
³ exp ¡
2 ksi ej ¡b ck,p k2 ¡b rk,p
s2
´
942
Michael Schmitt
as close to 0 as necessary by choosing s sufciently small. Since this does not affect the node that outputs 1, network N 0 computes a value close to 1 on si ej . On the other hand, for the case si ej 2 S 0, it was shown in theorem 1 that all nodes Hk,p output 0. Thus, each of them satises condition 4.2, implying that if s is sufciently small, each node Gk,p , and hence N 0 , outputs a value close to 0. Altogether, S is shattered by thresholding the output of N 0 at 1 / 2. Finally, we show that S can be shattered by a network N of the same size with DOG neurons as hidden nodes. The computation of an extended gaussian neuron can be rewritten as
gQ (c, s, a, x) D D D D
³ ³ ´ ´2 kx ¡ ck2 1¡ a exp ¡ ¡1 s2 ³ ³ ´ ³ ´ ´ 2kx ¡ ck2 kx ¡ ck2 C1 1¡ a2 exp ¡ ¡2a exp ¡ s2 s2 ³ ´ ³ ´ kx ¡ ck2 2kx ¡ ck2 2 2a exp ¡ ¡a exp ¡ s2 s2 p gDOG ( c, s, s / 2, 2a, a2 , x ) .
Hence, the extended gaussian neuron is equivalent to p a weighted difference of two gaussian neurons with center c, widths s, s / 2, and weights 2a, a2 , respectively. Thus, the extended gaussian neurons can be replaced by DOG neurons, which completes the proof. We note that the network of extended gaussian neurons constructed in the previous proof has all output weights xed, whereas the output weights of the DOG neurons, that is, the parameters a and b in the notation of section 2.1, are calculated from the parameters of the extended gaussian neurons and therefore depend on the particular dichotomy to be implemented. (It is trivial for a DOG network to have an output node with xed weights since the DOG neurons have built-in output weights.) We are now able to deduce a superlinear lower bound on the VC dimension of DOG networks. Suppose N is a network with one hidden layer of DOG neurons and a linear output node. Let N have k hidden nodes and input dimension n ¸ 2, where k · 2n . Then N has VC dimension at least Corollary 7.
µ ¶ µ ³ ´¶ ³ µ ³ ´¶ ´ k k k C1 . ¢ log ¢ n ¡ log 2 2 2
Neural Networks with Local Receptive Fields
943
Let W denote the number of weights and assume that k · 2n/ 2 . Then the VC dimension of N is at least ³ ´ W k log . 5 4 For xed input dimension, the VC dimension of N
is bounded by V ( k) and V ( W ) .
Proof. The results are implied by theorem 4 in the same way as corollaries 1
through 3 follow from theorem 1. 4.2 Networks of Gaussian RBF Neurons. We can now give the answer to the question of Anthony and Holden (1994) quoted in section 1.
Suppose N is a network with one hidden layer of gaussian RBF neurons and a linear output node. Let k be the number of hidden nodes and n the input dimension, where n ¸ 2 and k · 2n C 1 . Then N has VC dimension at least Theorem 5.
µ ¶ µ ³ ´¶ ³ µ ³ ´¶ ´ k k k C1 . ¢ log ¢ n ¡ log 4 4 4 Let W denote the number of weights and assume that k · 2 ( n C 2) / 2 . Then the VC dimension of N is at least ³ ´ W k log . 12 8 For xed input dimension n ¸ 2 the VC dimension of N and V (W ) .
satises the bounds V ( k)
Proof. Clearly, a DOG neuron can be simulated by two gaussian RBF neu-
rons. Thus, by virtue of theorem 4, there is a network N with m C q input nodes and one hidden layer of 2(h C 2q ) gaussian RBF neurons that shatters some set of cardinality hq ( m C 1) . Choosing h D bk / 4c, q D blog(k/ 4)c, and m D n ¡ blog(k / 4) c, we obtain similarly to corollary 4 the claimed lower bound in terms of n and k. Furthermore, the stated bound in terms of W and k follows by analogy to the reasoning in corollary 5. Finally, the bound for xed input dimension is obvious, as in the proof of corollary 3. Some RBF networks studied theoretically or used in practice have no adjustable width parameters (for instance, Broomhead & Lowe, 1988; Powell, 1992). Therefore, a natural question is whether the previous result also holds for networks with xed width parameters. The values of the width
944
Michael Schmitt
parameters for theorem 5 arise from the widths of DOG neurons specied in theorem p 4. The two width parameters of each DOG neuron have the form s and s / 2, where s is common to all DOG neurons and is required only to be sufciently small. Hence, we can choose a single s that is sufciently small for all dichotomies to be induced. Thus, for the RBF network, we not only have that the width parameters can be xed, but even that there need to be only two different width values—depending solely on the architecture and not on the particular dichotomy. Let N be a gaussian RBF network with n input nodes and k hidden nodes satisfying the conditions of theorem 5. Then there exists a real number sk,n > 0 such that the VC dimension bounds statedpin theorem 5 hold for N with each RBF neuron having xed width sk,n or sk,n / 2. Corollary 8.
With regard to theorem 5, we further remark that k has been previously established as lower bound for RBF networks by Anthony and Holden (1994). Further, also theorem 19 of Lee et al. (1995) in connection with the result of Erlich et al. (1997) implies the lower bound V (nk) , and hence V ( k) for xed input dimension. By means of theorem 5, we are now able to present a lower bound that is even superlinear in k. Let n ¸ 2 and N be the network with k D 2n C 1 hidden gaussian RBF neurons. Then N has VC dimension at least Corollary 9.
³ ´ k k log . 3 8 Proof. Since k D 2n C 1 , we may substitute n D log k ¡ 1 in the rst bound of
theorem 5. Hence, the VC dimension of N
is at least
µ ¶ µ ³ ´¶ ³ µ ³ ´¶´ µ ¶ µ ³ ´¶ k k k k k ¢ log ¢ log k ¡ log ¸2 ¢ log . 4 4 4 4 4 As in the proof of corollary 5, we use that bk/ 4c ¸ k/ 6 and blog(k/ 4)c ¸ log(k / 8) . This yields the claimed bound. 5 Bounds for Single Neurons
In this section we consider the three discrete variants of a local receptive eld neuron and the gaussian RBF neuron. We show that their VC dimension is at most linear. Furthermore, this bound is asymptotically tight. 5.1 Discrete Neurons. We assume in the following that the output of the ternary CSRF neuron is thresholded at 1 / 2 or any other xed value from
Neural Networks with Local Receptive Fields
945
the interval (0, 1], to obtain output values in f0, 1g. Thus, we can treat the binary and ternary CSRF neuron similarly. (If the threshold is chosen from the interval [¡1, 0], this corresponds to a negated binary RBF neuron and, hence, has the VC dimension of the latter.) The VC dimension of a binary RBF neuron with n inputs is equal to n C 1. The VC dimension of a (binary and ternary) center-surround neuron with n inputs is at least n C 1 and at most 4n C 5. Theorem 6.
Proof. The class of functions computed by a binary RBF neuron with n
inputs can be identied with the class of balls in Rn . Dudley (1979) shows that the VC dimension of this class is equal to n C 1 (see also Wenocur and Dudley, 1981; Assouad, 1983). This gives the result for the RBF neuron. Clearly, a binary and ternary center-surround neuron can simulate the RBF neuron by adjusting the center radius to 0. This implies the lower bound n C 1. For the upper bound, consider a center-surround neuron with n inputs and assume, without loss of generality, that its output is binary. Let c D ( c1 , . . . , cn ) 2 Rn be the center and a, b 2 R the radii. Then if f : Rn ! f0, 1g is the function computed by the neuron, on some input vector x D ( x1 , . . . , xn ) 2 Rn , it satises f ( x ) D 1 () kx ¡ ck ¸ a () ( x1 ¡ c1
)2
and
kx ¡ ck · b
C ¢ ¢ ¢ C ( xn ¡ cn ) 2 ¸ a2
and
( x1 ¡ c1 ) 2 C ¢ ¢ ¢ C ( xn ¡ cn ) 2 · b2
() kxk2 ¡ 2c1 x1 ¡ ¢ ¢ ¢ ¡ 2cn xn ¸ a2 ¡ c21 ¡ ¢ ¢ ¢ ¡ c2n
and
¡ kxk2 C 2c1 x1 C ¢ ¢ ¢ C 2cn xn ¸ ¡b2 C c21 C ¢ ¢ ¢ C c2n .
Each of the last two inequalities denes a halfspace in Rn C 1 , both with weights 1, ¡2c1 , . . . , ¡2cn or the negative thereof, and with thresholds a2 ¡ c21 ¡ ¢ ¢ ¢ ¡ c2n or ¡b2 C c21 C ¢ ¢ ¢ C c2n , respectively. Thus, we have that the number of dichotomies induced by a binary center-surround neuron on some nite subset of Rn is not larger than the number of dichotomies induced by intersections of parallel halfspaces on some subset of Rn C 1 with the same cardinality, where the additional input component is obtained as kxk2 for every input vector. Hence, a set shattered by a center-surround neuron in Rn gives rise to a set of the same cardinality shattered by intersections of parallel halfspaces in Rn C 1 . According to theorem 9, which is given in the appendix, the VC dimension of the class of intersections of parallel halfspaces in Rn C 1 is at most 4n C 5. This entails the bound for the center-surround neuron. In contrast to the RBF neuron, the exact values for the VC dimension of center-surround neurons are not known yet.
946
Michael Schmitt
5.2 Gaussian RBF Neurons. It is easy to see that a thresholded gaussian RBF neuron, that is, one with a xed output threshold, is equivalent to a binary RBF neuron. Hence, by theorem 6, its VC dimension is equal to the number of its parameters. The pseudodimension generalizes the VC dimension to real-valued function classes and is dened as follows.
Let F be a class of functions mapping Rn to R. The pseudodimension of F is the cardinality of the largest set S µ Rn C 1 shattered by the class f(x, y ) 7! sgn ( f ( x) ¡ y) : f 2 F g. Denition 3.
The pseudodimension is a stronger notion than the VC dimension in that an upper bound on the pseudodimension of some function class also yields the same bound on the VC dimension of the thresholded class, whereas the converse need not necessarily be true. Thus, there is no general way of inferring the pseudodimension of a gaussian RBF neuron from the VC dimension of a binary RBF neuron. Nevertheless, the pseudodimension of a single gaussian RBF neuron is linear, as we show now. The pseudodimension of a gaussian RBF neuron with n inputs is at least n C 1 and at most n C 2. Theorem 7.
Proof. The lower bound easily follows from the facts that a thresholded
gaussian RBF neuron can simulate any binary RBF neuron, that a binary RBF neuron has VC dimension n C 1 (see theorem 6), and that the VC dimension is a lower bound for the pseudodimension. We obtain the upper bound as follows: According to a well-known result (see, e.g., Haussler, 1992, theorem 5), since the function z 7! exp(¡z) is continuous and strictly decreasing, the pseudodimension of the class »
³
kx ¡ ck2 x 7! exp ¡ s2
´
¼ n
: c 2 R , s 2 R n f0g
is equal to the pseudodimension of the class »
¼ kx ¡ ck2 n x 7! : c 2 R , s 2 R n f0g , s2
which is, by denitions 1 and 3, equal to the VC dimension of the class »
³ (x, y ) 7! sgn
´ ¼ kx ¡ ck2 n ¡ y : c 2 R , s 2 R n f0g . s2
Neural Networks with Local Receptive Fields
947
This class can also be written as n o ( x, y) 7! sgn (kxk2 ¡ 2c ¢ x C kck2 ¡ s 2 y ) : c 2 Rn , s 2 R n f0g . Each function in this class has the form ( x, y ) 7! sgn ( f ( x ) C g (x, y ) ) with f (x ) D kxk2 and g being an afne function in n C 1 variables. Hence, the VC dimension of this class cannot be larger than the VC dimension of the class fsgn ( f C g ) : g afne, g: Rn C 1 ! Rg. Wenocur and Dudley (1981) show that if G is a d-dimensional vector space of real-valued functions, then fsgn ( f C g ) : g 2 Gg has VC dimension d (see also Anthony and Bartlett, 1999, theorem 3.5). Thus, the upper bound follows since the class of afne functions in n C 1 variables is a vector space of dimension n C 2. 6 Upper Bounds for Networks of Discrete Neurons
The following result shows that one-hidden-layer networks of discrete local receptive eld neurons have a VC dimension bounded by O ( W log k) . This implies that the lower bounds established in section 3 are asymptotically tight. For the proof, we employ a method from a similar result for threshold networks. Suppose N is a network with one hidden layer of binary RBF neurons and a linear output node. Let k denote the number of hidden nodes and W D nk C 2k C 1 the number of weights. Then the VC dimension of N is at most 2W log( (2 k C 2) / ln 2) . If the hidden nodes are binary or ternary CSRF neurons, the VC dimension is at most 2W log( (4k C 2) / ln 2). Theorem 8.
Proof. A binary RBF neuron with n inputs has VC dimension n C 1 (see
theorem 6). The output node of N is a linear neuron with k inputs; thus, it has VC dimension k C 1. Since each node of N has a VC dimension equal to the number of its parameters, it follows by reasoning similarly as in theorem 6.1 of Anthony and Bartlett (1999) that the number of dichotomies induced by N on a set of cardinality m, where m ¸ W, is at most (em ( k C 1) / W ) W . (Note that N has k C 1 computation nodes.) This implies that the VC dimension is at most 2W log( (2k C 2) / ln 2). Consider now the case that the hidden nodes are CSRF neurons. Clearly, a weighted binary or ternary CSRF neuron can be simulated by a weighted combination of two binary RBF neurons. Thus, N can be simulated by a network N 0 with 2k binary RBF neurons as hidden nodes. Now observe that each CSRF neuron gives rise to two RBF neurons with the same center. Thus, although the number of nodes and connections in N 0 has increased,
948
Michael Schmitt
the number of parameters is the same as in N . In other words, N 0 is a network with equivalences among its weights. Combining a method due to Shawe-Taylor (1995) for networks with equivalences with the abovementioned derivation by Anthony and Bartlett (1999), we obtain that N 0 induces at most ( em (2k C 1) / W ) W dichotomies on a set of cardinality m. This results in a VC dimension not larger than 2W log( (4 k C 2) / ln 2) . 7 Conclusion
Local receptive elds occur in many kinds of biological and articial neural networks. We have studied here several models of local receptive eld neurons and have established superlinear VC dimension lower bounds for networks with one hidden layer. Although compared with the previously known linear bounds, at rst sight the gain by a logarithmic factor seems exiguous, there are at least two arguments showing that it constitutes a signicant improvement. First, the VC dimension is a rather coarse measure. Increasing it by one amounts to doubling the number of functions computed by the network. Second, in a network with the VC dimension linearly bounded from above by the number of weights, each weight can be considered responsible for a particular input vector. Superlinearity implies that each weight manages to get hold of a number of input vectors that increases with the network size. Thus, in networks with superlinear VC dimension, the neurons have found a very effective way to cooperate and coordinate their computations. The VC dimension yields bounds on the complexity of learning for several models of learnability. For instance, bounds on the computation time or the number of examples required for learning can often be expressed in terms of the VC dimension. If the VC dimension provides a lower bound in a model of learning, then the superlinear lower bounds given here yield new lower bounds on the complexity of learning using local receptive eld neural networks. Of course, if the VC dimension serves as an upper bound in a model, there is no immediate consequence. But then one may be encouraged to nd other measures that more tightly quantify the complexity of learning in these models. For the discrete versions of local receptive eld neurons, we have shown that the superlinear lower bounds for networks are asymptotically tight. The currently available methods for RBF and DOG networks give rise only to the upper bound O (W 2 k2 ) for these networks. This bound, however, is also valid for networks of unrestricted depth and of sigmoidal neurons. The problems of narrowing the gaps between upper and lower bounds for RBF and sigmoidal networks with one hidden layer therefore seem to be closely related. We have also established tight linear bounds for the VC dimension of single discrete neurons and for the pseudodimension of the gaussian RBF neuron. The VC and pseudodimension of the DOG neuron can be shown to be at most quadratic. We conjecture also that the DOG neuron has linear VC
Neural Networks with Local Receptive Fields
949
and pseudodimension, but the methods currently available do not seem to permit an answer. In the constructions of the sets being shattered, we have permitted arbitrary real vectors. It is not hard to see that rational numbers sufce. It would be interesting to know what happens for even more restrictive inputs such as, for instance, Boolean vectors. We have also allowed that the centers of the local receptive eld neurons can be placed anywhere in the input domain. We do not know if the results hold when the centers may not freely oat around. The superlinear bounds involve constant factors that are the largest known for any standard neural network with one hidden layer. This fact could be interpreted as evidence that the cooperative computational capabilities of local receptive eld neurons are even higher than those of other neuron types. This statement, however, must be taken with a grain of salt since the constants in these bounds are not yet known to be tight. Gaussian units are just one type of RBF neuron. The method we have developed for obtaining superlinear lower bounds is of a quite general nature. We therefore expect it to be applicable to other RBF networks as well. The main clue in the result for RBF networks was rst to consider CSRF and DOG networks. With this idea, we have established a new kind of link between neurophysiological models and articial neural networks. This link extends the paradigm of neural computation by demonstrating that models originating from neuroscience lead not only to powerful computing mechanisms but can also be essential in theory, that is, in proofs concerning the computational power of those mechanisms. Appendix: A VC Dimension Upper Bound for Intersections and Unions of Parallel Halfspaces
We consider the function classes dened by intersections and unions of parallel halfspaces and derive an upper bound on the VC dimension of these classes. This bound was used in theorem 6. A general way of bounding the VC dimension of classes that are constructed from nite intersections and unions has been established by Blumer, Ehrenfeucht, Haussler, and Warmuth (1989). In particular, they show that if we form a new class having as members intersections of s functions from a class with VC dimension d, then the VC dimension of the new class is less than 2ds log(3s) (Blumer et al., 1989, lemma 3.2.3). The following new calculation for parallel halfspaces results in a bound with improved constants. The function class consisting of intersections of parallel halfspaces in Rn has VC dimension at most 4n C 1. The same holds for the class of unions of parallel halfspaces. Theorem 9.
950
Michael Schmitt
Proof. The proof is given for intersections; the result then follows for unions
by duality. Clearly, it is sufcient to consider intersections of only two parallel halfspaces. We use Rn C 2 to represent the joint parameter domain for the halfspaces. The rst n components dening the weights are shared by both halfspaces. The components n C 1 and n C 2 correspond to their separate thresholds. The main step is to derive an upper bound on the number of dichotomies induced on any set S µ Rn of cardinality m. We assume without loss of generality that S is in general position. (If not, then the elements can be perturbed to obtain a set in general position with a number of dichotomies no less than for the original set. See, e.g., Anthony and Bartlett, 1999, p. 34.) First, we give an upper bound on the number of dichotomies induced by pairs of parallel halfspaces where each halfspace is nontrivial. Here, we say that a halfspace is trivial if it induces one of the dichotomies (;, S ) or ( S, ;) . Such a bound is obtained in terms of the number of connected components into which the parameter domain is partitioned by certain hyperplanes arising from the elements of S. Every input vector ( s1 , . . . , sn ) 2 S gives rise to the two hyperplanes fx 2 Rn C 2 : s1 x1 C ¢ ¢ ¢ C sn xn ¡ xn C 1 D 0g, fx 2 Rn C 2 : s1 x1 C ¢ ¢ ¢ C sn xn ¡ xn C 2 D 0g, that is, their representations in Rn C 2 are the vectors ( s1 , . . . , sn , ¡1, 0) and (s1 , . . . , sn , 0, ¡1) , respectively. All hyperplanes are homogeneous; that is, they pass through the origin. It is clear that for every connected component of Rn C 2 arising from this partition, the two functions induced on S by the pair of halfspaces represented in this way are the same for all vectors belonging to this component. Thus, the number of connected components provides an upper bound on the number of induced dichotomies. A well-known result attributed to Schl¨ai (1901) states that m homogeneous hyperplanes in general position partition Rn into exactly
³
n X m ¡1 2 i iD 0
´ (A.1)
connected components (see also Anthony and Bartlett, 1999, lemma 3.3). Hence, the set S, giving rise to 2m hyperplanes, partitions Rn C 2 into at most
³
2
nC1 X 2m ¡ 1 iD 0
i
´ (A.2)
Neural Networks with Local Receptive Fields
951
connected components. Not all of them, however, represent pairs of nontrivial halfspaces. For every trivial dichotomy of the rst halfspace, we can have as many dichotomies induced by the second halfspace as there are dichotomies possible by a single halfspace in Rn C 1 on a set of cardinality m; the same holds for every trivial dichotomy of the second halfspace. Hence, using Schl¨ai’s (1901) count, equation A.1, we may subtract from equation A.2 the amount of
³
³
´´
n X m¡1 4 i iD 0
³
³
´´
n X m¡1 4 i iD 0
C
¡ 4.
(A.3)
The term ¡4 at the end results from the fact that the four combinations of trivial halfspaces are counted by both sums. Note also that the number given in equation A.3 is not a bound but is precise since S is in general position. Up to this point, the pairs of halfspaces also include redundant combinations where the intersection is empty or one halfspace is a subset of the other. Clearly, for every nonredundant pair, there are three redundant ones. Therefore, we can exclude the latter, dividing the number by 4. Thus, an upper bound for the number of pairs of nontrivial halfspaces with nonempty intersections is obtained by subtracting one-fourth of equation A.3 from A.2, giving
³
³
nC1 2m ¡ 1 1X 2 iD 0 i
´´
³
³
n X m ¡1 ¡ 2 i iD 0
´´ C 1.
(A.4)
That the intersection of two halfspaces is nonempty does not imply that the dichotomy induced on S is nonempty. Therefore, we are allowed to exclude these cases. Each pair of nontrivial halfspaces with empty intersection on S gives rise to two nontrivial dichotomies that can be induced by a single halfspace. Thus, we may subtract half the number of nontrivial dichotomies induced by a single halfspace, which is
³
³
n X m ¡1
´´
i
iD 0
¡ 1.
(A.5)
Finally, we take those pairs into account where at least one halfspace induces a trivial dichotomy. In this case, the dichotomy can be induced by a single halfspace, that is, we may add the amount given by equation A.1. All in all, an upper bound is provided by equation A.4, minus A.5, plus A.1, yielding
³
³
nC1 2m ¡ 1 1X 2 iD 0 i
´´
³
³
n X m¡1 ¡ i iD0
´´ C 2.
(A.6)
952
Michael Schmitt
Assuming 1 · n C 1 · 2m¡1 without loss of generality, we use the estimates
³
nC1 2m ¡ 1 1X 2 iD 0 i
´
1 < 2
³
e (2m ¡ 1) nC1
´n C 1
(see, e.g., Anthony and Bartlett, 1999, theorem 3.7) and
³
n X m ¡1 i iD 0
´ ¸ 2,
whence we obtain that the number of dichotomies is less than 1 2
³
e (2m ¡ 1) nC1
´n C 1
.
Now suppose that S is shattered. Then all 2m dichotomies must be induced, which implies that ³ 2
m
0 satisfy the inequality ln a · ab C ln(1 / b ) ¡ 1 (see, e.g., Anthony and Bartlett, 1999, appendix A.1.1). Substituting a D (2m ¡ 1) / 2 and b D (ln 2) / (2( n C 1) ) yields ³ ln
2m ¡ 1 2
´ ·
³ ´ (2m ¡ 1) ln 2 2(n C 1) C ln , 4(n C 1) e ln 2
implying ³
2m ¡ 1 ( n C 1) log 2
´
³ ´ 1 2(n C 1) m · ¡ C ( n C 1) log . 2 4 e ln 2
Neural Networks with Local Receptive Fields
953
Using this in inequality A.7, it follows that
m
¡60 mV) do not burst, but exhibit integrate-and-re behavior when a depolarizing current is applied. Both of these features are built into the integrate-and-re-or-burst (IFB) model that we use to model the activity of each neuron of a population. After a brief review of the single-cell equations, we derive the kinetic equation for a population of IFB cells. 2.1 Dynamical Equations for an IFB Neuron. The IFB model introduced in Smith et al. (2000) includes a Hodgkin-Huxley type equation for the voltage V, C
dV D I ¡ IL ¡ IT . dt
(2.1)
Here, I is an applied current, IL is a leakage current of the form IL D gL ( V ¡ VL ) ,
(2.2)
and the current IT couples the dynamics of V to the calcium conductance variable h, I T D gT m 1 h ( V ¡ V T ) ,
(2.3)
A Population Study of Integrate-and-Fire-or-Burst Neurons
961
where m1 ( V ) is an activation function for the Ca2 C channel, and VL and VT are the reversal potentials for the leakage and calcium ions. For simplicity, m1 is represented as a Heaviside function: ( m1 ( V ) D H (V ¡ Vh ) D
1 0
( V > Vh ) (V < Vh ) .
(2.4)
The dynamics of the Ca2 C current, which typically varies on long timescales relative to the time course of a fast sodium spike (· 4 ms), is given in the IFB model by 8 h > > > < ¡t¡
dh h D 1¡h > dt > > : C th
( V > Vh ) (2.5) (V < Vh ) ,
so h always approaches either zero or one. The parameter Vh divides the V axis into a hyperpolarizing region (V < Vh ), where the calcium current is deinactivated, and a nonhyperpolarizing region (V > Vh ) in which the calcium current is inactivated. The timescale th¡ sets the duration of a burst event, and thC controls the inactivation rate. In accordance with the literature, thC À th¡ (Smith et al., 2000). The leakage reversal potential VL (¼ ¡65 mV) sets the rest voltage in the absence of a stimulus, and for mammalian LGN cells in vitro, VL typically lies below the potential Vh (¼ ¡60mV) at which burst behavior is observed (Jahnsen & Llinas, 1982). The reversal potential VT (¼ 120 mV) for the calcium ions is relatively large and causes rapid depolarization once the T-channels are activated. On crossing the ring-threshold voltage, Vh ¼ ¡35 mV, the cell res and the membrane potential is reset to Vr > Vh , where Vr ¼ ¡50 mV is typical. Consequently, the ve voltage parameters of the IFB model satisfy the relation VL < Vh < Vr < Vh < VT .
(2.6)
The calcium variable h ranges between 0 and 1. For relatively large values of h, a cell will burst due to the calcium channel coupling term and will produce fast trajectories in V with rapid resets. For relatively low values of h, given a large enough input current, tonic ring occurs. Figure 1 illustrates these dynamic features of the IFB model. In this simulation, the neuron was driven by Poisson-distributed synaptic events that increased the membrane potential by 2 D 1.5 mV with each occurrence. The driving rate was chosen too small for the cell to re in tonic mode. Instead, the cell drifted randomly near the threshold Vh D ¡60 mV for Ca2 C deinactivation. Bursts occurred
962
A.R.R. Casti et al.
Stochastic Input (Single Neuron) 1 V h
Vq 50
100
0.5
0
100
200
300
400
500
t (ms)
600
700
800
900
h
V (mV)
0
0 1000
1 0.8
h
0.6 0.4 0.2 0 65
60
55
50
V (mV)
45
40
35
Figure 1: Simulation of a single IFB neuron driven by Poisson-distributed mA synaptic events with xed mean arrival rate s 0 D CI2 , with I D .08 cm 2 , 2 D 1.5 mV, and all other parameters as in Table 1. (Top) Time series for V (solid line) and h (dashed line). The horizontal line demarcates the threshold V D Vh . (Bottom) Phase plane orbit. The asterisk locates the initial condition ( V 0, h 0 ) D (¡61, .01) . In the time span shown, there were two burst events, the rst of containing three spikes and the other four spikes.
whenever the calcium channel deinactivated and enough clustered synaptic events, in collaboration with the T-channel current, conspired to drive the neuron past the ring threshold. 2.2 Population Dynamics. Our aim is to study the behavior of a population of excitatory IFB neurons. A general approach to this problem, in the limit of a continuous distribution of neurons, is presented in Knight et al. (1996), Knight (2000), Nykamp and Tranchina (2000), and Omurtag, Knight, et al. (2000). Based on the development in the cited references, the number of neurons in a state v D ( V, h ) at time t is described by a probability density, r ( V, h, t) , whose dynamics respects conservation of probability. The evolution equation for r thus takes the form of a conservation law, @r @t
D ¡
@ @v
¢ J.
(2.7)
A Population Study of Integrate-and-Fire-or-Burst Neurons
963
The probability ux, J D JS C Js ´ ( JV , Jh ) ,
(2.8)
is split into two parts. The rst part is a streaming ux, JS D F(v)r ,
(2.9)
due to the direction eld of the single-neuron dynamical system, equations 2.1 through 2.5, where Á F(v) D
¡C
¡1
[IL C IT ] ,
(1 ¡ h ) thC
! h H ( Vh ¡ V ) ¡ ¡ H ( V ¡ Vh ) th
(2.10)
´ ( FV , Fh ) . The applied current enters our analysis as a stochastic arrival term in equation 2.7, with an arrival rate s ( t), so that each arrival elevates the membrane voltage by an increment 2 . Written in terms of an excitation ux, this is expressed as Z Js D eˆ V s ( t)
V V¡2
Q h, t) dVQ , r ( V,
(2.11)
where eˆ V is a unit vector pointing along the voltage direction in the ( V, h ) phase space. This states that the probability current in the voltage direction, across the voltage V, comes from all population members whose voltages range below V by an amount not exceeding the jump voltage 2 . The assumptions underlying equation 2.11 are detailed in Omurtag, Knight, and Sirovich (2000). With the ux denitions 2.9 and 2.11, the population density r ( V, h, t) evolves according to @r @t
D¡
@ @v
¢ [F(v)r] ¡ s ( t) [r ( V, h, t) ¡ r ( V ¡ 2 , h, t )] .
(2.12)
Although a realistic synaptic arrival initiates a continuous conductance change, this effect is well approximated by a jump of size 2 in the membrane potential. Thus, we see in equation 2.12 a loss term proportional to r (V, h, t) and a gain term proportional to r ( V ¡ 2 , h, t) due to synaptic events. Hereafter, this is referred to as the nite-jump model. For purposes of exposition, we restrict attention to excitatory current inputs. We note that (like Smith et al., 2000) we are investigating the dynamics of the thalamic cell alone, and not the integrated dynamics of the retinal ganglion cell and LGN cell pair. This involves one simplifying departure
964
A.R.R. Casti et al.
from the actual physiological situation. For the input current I, Smith et al. (2000) use a specied smooth function of time. Our population equation 2.12 goes a bit further by including the stochastic nature of synaptic arrivals, which are treated as uncorrelated with LGN cell activity. This should be in fair accord with the physiology for nonretinal input that is many-to-one but is only an earliest approximation for the one-to-one retinal input. 2.3 Population Firing Rate. A response variable of interest is the average ring rate of individual neurons in the population, r ( t) . In general, two input sources drive any cell of a particular population: synaptic events arriving at a rate s 0 ( t) that arise from the external neural input and synaptic events resulting from feedback within the population. We will ignore feedback in the interest of simplicity and take the input to be purely of external origin (see Omurtag, Knight, and Sirovich, 2000, for the more general treatment). The population will be assumed homogeneous, so that each neuron is driven equally at the input rate s D s 0 ( t). In the case of stochastic external input, s 0 ( t) is the ensemble mean arrival rate of external nerve impulses, *
1 s t) D lim D t#0 D t
Z
tC D t
0(
dt
t
1 X nD 1
+ d (t
¡ tn0 )
,
(2.13)
where d ( t) is the Dirac delta function, h¢i denotes an ensemble average, and ftn0 g1 n D 1 is a set of spike arrival times. The population ring rate, r ( t), is determined by the rate at which cells cross the threshold Vh . This may be expressed as a function of the voltagedirection ux at threshold integrated over all calcium channel activation states h, Z r (t ) D D
Z
1 0 1 0
¡ ¢ dh J V Vh , h, t
(2.14) Z
dh FV ( Vh , h)r (Vh , h, t) C s 0 ( t)
1Z 0
Vh
Vh ¡2
Q h, t ) . dh dVQ r ( V,
2.4 Boundary Conditions. Boundary conditions on r are chosen to conserve total probability over the phase-space domain D , Z D
r (v, t) dv D 1 .
(2.15)
From equation 2.7 and an application of the divergence theorem, it follows that the boundary ux integrates to zero, I @D
nˆ ¢ J dS D 0 ,
(2.16)
A Population Study of Integrate-and-Fire-or-Burst Neurons
965
where nˆ is a boundary normal vector. Our domain of interest is the box D D fVL · V · Vh , 0 · h · 1g, for which it is appropriate to choose the no-ux boundary conditions nˆ ¢ J D 0 at V D VL , h D 0, 1 ,
(2.17)
so that the outward ux vanishes at each boundary face except at the threshold boundary V D Vh . To handle the voltage offset term, r (V ¡ 2 , h, t) , in the population equation 2.12, which requires evaluation of the density at points outside the box, it is natural to choose r to vanish at all points outside D , and we set r (v, t) D 0 , V 2/ D .
(2.18)
In addition, there is a reset condition that reintroduces the ux at V D Vh back into the domain at the reset V D Vr . This may be incorporated directly into the population equation 2.7 by means of a delta function, @r @t
D¡
@ @v
¢ J C JV (Vh , h, t) d ( V ¡ Vr ) .
(2.19)
It is worth noting that the threshold ux is reset pointwise in h. This is consistent with the slow temporal variation of the calcium channel, and we assume that h does not change appreciably during the time between a cell ring a spike and its subsequent fast relaxation to reset. Under these conditions, with the total ux J suitably redened to include the delta function source, probability is conserved and equation 2.16 is satised. Equation 2.19, with the boundary conditions 2.17 and 2.18, forms the complete mathematical specication of the model for a single population. 2.5 Diffusion Approximation. An approximation that lends itself to simpler analysis is the small-jump limit, 2 ! 0. Upon expanding the density, r, in the excitation ux equation 2.11 through second-order terms, one obtains Q h, t) D r ( V, h, t) C dV r ( V, Z eV ¢ Js D
0 ¡2
@r @V
( V, h, t) C O (d V 2 )
( VQ D V C dV)
(2.20)
µ ¶ @r ( V, h, t) C . . . d (d V ) r ( V, h, t) C dV @V
¼ 2 r ( V, h, t ) ¡
2 2 @r
2 @V
( V, h, t) .
(2.21)
Substitution of equation 2.21 into 2.7 gives the diffusion approximation (the Fokker-Planck equation), @r @t
D ¡
@ @v
¢ (F C s2 eV ) r C
¡ ¢ s2 2 @2r C J V Vh , h, t d ( V ¡ Vr ) , 2 2 @V
(2.22)
966
A.R.R. Casti et al.
Table 1: Model Parameters. th¡ thC gL gT VT VL Vh Vr Vh C
2 £ 10¡2 sec 10¡1 sec 3.5 £ 10¡2 mS/cm2 7 £ 10¡2 mS/cm2 120 mV ¡65 mV ¡35 mV ¡50 mV ¡60 mV 2 m F/cm2
where the approximate threshold ux is ¡ ¢ ¡ ¢ ¡ ¢ s2 2 @r ¡ ¢ JV Vh , h, t D [FV Vh , h C s2 ]r Vh , h, t ¡ Vh , h, t . 2 @V
(2.23)
The diffusion approximation is also useful for comparison with the nitejump model, and this will be done in section 3.2 (also see Sirovich et al., 2000). A further simplication of the diffusion approximation is valid in the 2 ! 0 limit with s2 nite. This special limit reduces equation 2.22 to a pure advection equation for which exact solutions can be derived (see section 3.1). Physically, this approximation is appropriate when input spike rates are extremely large and the evoked postsynaptic potentials are small. This would be the case, for example, when DC input overwhelms stochastic uctuations. 3 Results In Table 1 we show the parameter values used in our simulations, following Smith et al. (2000), whose choices are based on experimental data from cat LGN cells. In particular, when the cell is in burst mode, the parameter subspace near these values produces 2 to 10 spikes per burst. We tested the population simulation with a variety of external inputs. In all cases, we compared the accuracy against a direct numerical simulation (DNS), in which equations 2.1 and 2.5 were numerically integrated for a discrete network of 104 neurons, each driven by Poisson-distributed action potentials involving the same voltage elevation, 2 , with the same mean arrival rate, s 0 , as in the population simulation. As explained in the appendix, a population simulation is sensitive to nite grid effects that do not affect the DNS, and for this reason we treat the DNS as the standard for comparison. As in Omurtag, Knight, and Sirovich (2000), the two approaches converge for a sufciently ne mesh. For instance, as seen in Figure 6, the ring-rate response curve to a step input current generated by the DNS overshoots the converged mean ring rate of
A Population Study of Integrate-and-Fire-or-Burst Neurons
967
the population simulation by about 20%. These simulations took roughly the same amount of computation time, but we note that as the number of neurons of the DNS increases, the population simulation becomes far more computationally efcient. Since the uctuations in the DNS about the true solution scale inversely proportional to the square root of the number of neurons, one would need four times as many neurons to reduce the error by half. For an uncoupled population, the computation time would increase by a factor of about 4. The increase in computation time for the DNS is even greater when the neurons are coupled, whereas the population simulation demands no extra computation time. 3.1 An Exactly Solvable Case. To compare the results of the population simulation with an exactly solvable problem, we rst considered the case of constant driving, s 02 D CI , in the diffusion approximation 2.22 of the 2
population equation. We further suppose that the diffusive term, s 22 @@Vr2 , is negligible, a justiable simplication if the voltage increment 2 arising from a random spike input is small but s 02 nite. This case, which may also be viewed as a population model for which the external driving is noiseless, then reduces to the pure advection equation, 0 2
@r @t
D¡
@ ( ( FV C CI )r ) @ ( Fh r ) C d ( V ¡ Vr ) J V ( Vh , h, t ) . ¡ @V @h
(3.1)
¡ ¢ Here F D FV , Fh is dened by equation 2.10. This equation can be solved exactly by the method of characteristics. In the absence of diffusive smearing of the density eld by stochastic effects, the population density r traces the single-neuron orbit dened by equations 2.1 and 2.5. Upon dividing these equations, we obtain the characteristic trace equation, FV C dV D dh Fh
I C
.
(3.2)
Integrating equation 3.2 in the Ca2 C -inactivated region (V > Vh ) gives VD
µ ³ ´c 1 c (h¡h0 ) h L C th¡ ( I C gL VL ) (c T ) c L eT CV0 C h0 £(C (¡c L , c T h ) ¡ C (¡c L , c T h0 ) ) C CVT (c T h) c L ec T h0 (C (1 ¡ c L , c T h ) ¡ C (1 ¡ c L , c T h0 ) )
¶ (3.3)
R1 gT t ¡ gL t ¡ where c T D C h , c L D Ch , C ( a, z ) D z ta¡1 e ¡t dt is the incomplete gamma function (Abramowitz & Stegun, 1972), h 0 denes the starting point of a
968
A.R.R. Casti et al.
IFB Characteristic Trace
1
(V ,h ) eq
0.9
eq
0.8 0.7
h
0.6 0.5 0.4 0.3 0.2 0.1
V
V
h
0 65
r
60
55
50
V (mV)
45
40
35
Figure 2: Characteristic traces (see equations 3.3 and 3.4) with an input current mA I D .05 cm 2 , which leads to a burst of four spikes. The asterisk in the upper left corner at ( Veq , heq ) D (¡63.6, 1) indicates the xed point to which the neuron will eventually settle. The reset condition has been added manually.
particular characteristic curve, and V 0 D V ( h 0 ) . In the Ca2 C -deinactivated region, the dynamics of V and h are uncoupled, and we nd µ
V D VL C
I I C V0 ¡ VL ¡ gL gL
¶³
C
h¡1 h0 ¡ 1
´ gL th C
.
(3.4)
Characteristic lines for the case of a calcium-driven burst event, using the initial point (V 0, h 0 ) D (Vh , 1) , are shown in Figure 2. Because the input is nonstochastic and subthreshold, a neuron driven with this small current mA (I D .05 cm ) ( ) (¡63.6, 1) (marked 2 equilibrates at the xed point at Veq , heq D by the asterisk). Before reaching the equilibrium, a burst of four spikes, preceded by a low-threshold calcium spike, was red before the calcium channels fully deinactivated. Stochastic input, explored in section 3.2, has the effect of increasing the average ring rate of a population of IFB cells due to additional depolarizing input from random, excitatory spike inputs.
A Population Study of Integrate-and-Fire-or-Burst Neurons
969
However, the number of spikes per burst event generally remains the same because the IFB dynamics are dominated by the calcium current when V > Vh and h > 0 (see Figure 5 for comparison). If the input current is large, each neuron res in a classic integrate-and-re (Omurtag, Knight, and Sirovich, 2000) tonic mode and the calcium channel equilibrates to the inactivated state h D 0. In this case, the equilibrium potential, Veq , of the average cell is given by setting dV 0 in equation 2.1, dt D with the result Veq D
I C VL . gL
(3.5)
The population is driven past threshold when I > Icrit D gL (Vh ¡ VL ) , in which case each neuron of the population res a periodic train of spikes with a time-averaged ring rate that eventually equilibrates to a constant value rN, 1 rN D T
Z
T
r ( s ) ds ,
(3.6)
t¡T
with r ( t) given by Z r ( t) D
1 0
dh FV ( Vh , h )r ( Vh , h, t) .
(3.7)
This time-averaged ring rate rN is independent of the reference time, t, provided t is chosen large enough for each member of the population to have equilibrated to its limit cycle. The interval T is chosen large enough to encompass many traversals of a given cell from its reset potential through the threshold. To test the accuracy of the population simulation, we compared time-averaged population ring rate, rN, with the single-neuron ring rate, f ( I ) , which is obtained from equation 2.1 upon integrating through the interval [Vr , Vh ] : µ f (I) D
C Vr ¡ VL ¡ I / gL ln gL Vh ¡ VL ¡ I / gL
¶¡1
.
(3.8)
In Figure 3 we plot the exact result (see equation 3.8) and simulation results, and see that they are in excellent agreement. 3.2 Constant Stochastic Input. Irregularity in the arrival times in the external driving introduces stochasticity into the population evolution, which is modeled either as diffusion (see equation 2.22) or nite jumps in the membrane voltage (see equation 2.12). In either case, there is a nite chance that a cell will be driven through the threshold, Vh , even if the mean current
970
A.R.R. Casti et al.
IFB Model , 50 X 50 grid
25
Analytical (non stochastic) Numerical (non stochastic) Numerical (stochastic , e = .5 mV)
Firing Rate (Hz)
20
15
10
5
0
0
0.5
I (mA cm
2
1
1.5
)
Figure 3: Comparison of the exact analytical ring rate (see equation 3.8) (solid line) for a neuron driven by a nonstochastic current with the population ring rate obtained from simulation (asterisks). The population ring rate of the nitejump model with stochastic driving (2 D .5 mV) at an equivalent Poisson rate mA is included for comparison (circles). Horizontal axis: Applied current I ( cm 2 ). Vertical axis: Firing rate r ( t) (Hz).
is not strong enough to push the average cell through threshold. Thus, we expect to see higher ring rates compared to the case of purely deterministic driving (see section 3.1). The population results are in accord with this expectation, as shown in Figure 3. A notable feature of stochastic input is a nonvanishing ring rate for driving currents below the threshold. In Figure 3, this effect appears in the mA equilibrium ring-rate curve as a bump peaked at I D .175 cm 2 for the parameter values stated in Table 1. This bump is a consequence of low-threshold calcium spiking events. If the cell’s resting potential lies near the calcium channel activation threshold, Veq ¼ Vh , which occurs if the input rate sat¢ g ( V ¡V ) ¡ ises s 0 ¼ L Ch2 L s 0 ¼ .0875 ms¡1 , then random walks in voltage, in cooperation with activated calcium currents, occasionally drive neurons
A Population Study of Integrate-and-Fire-or-Burst Neurons
971
Equilibrium Density , Tonic Mode 0.1
Finite Jumps Diffusion Approximation
0.09 0.08 0.07
r
0.06 0.05 0.04
e = 1.5 mV s0 = 0.5 ms 1
0.03 0.02 0.01 0 65
60
55
50
V (mV)
45
40
35
Figure 4: Comparison of the time-independent equilibrium distributions for a population ring in tonic mode. The gure is plotted in the h D 0 plane. Solid line: Finite-jump numerical solution. Dash-dot line: Diffusion approximation. The numerically generated distribution, equation 2.22, and the analytical solution, equation 3.9, for the diffusion approximation are imperceptibly different. Parameters: s 0 D .5 ms ¡1 , 2 D 1.5 mV, 150 grid points in V and h (all other parameters given by Table 1).
through the threshold Vh . If the average resting membrane potential lies too far above Vh but still well below the threshold, then the calcium currents are rarely deinactivated for a sufcient duration to trigger the low-threshold calcium spike. With a xed Poisson arrival rate, the population always achieves a timeindependent equilibrium whose characteristic features hinge on whether the population is ring in tonic or burst mode. The tonic mode for any individual LGN cell is typied by an uninterrupted sequence of independently generated spikes, all occurring in the calcium-inactivated state (McCormick & Feeser, 1990). The equilibrium prole for a tonic-spiking population thus lies in a plane cutting through h D 0 in the two-dimensional phase space (see Figure 4). By contrast, a cell in burst mode res clusters of calcium-triggered
972
A.R.R. Casti et al.
Figure 5: Density plot of the numerically generated equilibrium solution for a population of continuously bursting cells (log r is plotted in the V ¡ h plane with color scale indicated on the right). Parameters: 50 £ 50 grid resolution, 2 D 1 mV, s 0 D .025 ms ¡1 , and all other parameters as in Table 1. The more jagged features of the density distribution are numerical artifacts owing to the modest resolution.
spikes followed by a refractory period on the order of 100 ms. The burst cycle repeats provided that any depolarizing input is small enough to allow the cell to rehyperpolarize below the calcium deinactivation threshold potential. Consequently, the density prole for a repetitively bursting population is spread throughout the phase space (see Figure 5).
3.2.1 Tonic Spiking. Since tonic spiking cells have inactivated calcium currents (h D 0), we may obtain an analytical expression for the equilibrium solution by solving the time-independent population equation in the absence of h-dependent dynamics. For ease of comparison with the numerical results, we focus on the more analytically tractable diffusion approximation,
A Population Study of Integrate-and-Fire-or-Burst Neurons
973
equation 2.22, which has the equilibrium solution req ( V ) D
2CJh b V¡ a V2 2 e I2
Z
Vh V
a 2
e¡bs C 2 s H (s ¡ Vr ) ds ,
(3.9)
2g where a D I2 L , b D 2 2 (1 C gLIVL ) , and H (V ¡ Vr ) is the Heaviside function. The equilibrium ring rate, Jh D JV (Vh , 0, t), is determined self-consistently from the normalization condition, equation 2.15. The exact solution, equation 3.9, is virtually identical to the simulation result. It is also seen that the diffusion approximation, when compared to the nite-jump case, has the effect of smoothing the population distribution. The displacement of the peak toward a lower voltage occurs because the diffusion approximation, obtained by a truncated Taylor series of the density r, does not correctly capture the boundary layer near V D Vh unless the input current is close to the threshold for tonic spiking. This issue, and others related to the equilibrium prole for a population of integrate-and-re neurons, has been explored more extensively in Sirovich et al. (2000).
3.2.2 Burst Firing. For the population to exhibit continuous bursting under statistically steady input, the driving must be small enough so that the average neuron spends most of its time sufciently hyperpolarized below (or near) the Ca2 C activation threshold at V D Vh . This situation is achieved g when the input spike rate satises s 0 < CL2 ( Vh ¡ VL ) . The representative features of a population in burst mode are illustrated in Figure 5, which shows a numerical simulation of the population equation, 2.19, driven by random, nite voltage jumps. Most of the neuron density equilibrates near the xed point ( Veq , heq ) D (¡63.6, 1) of the single-neuron system under the driving condition of the same average current, but steady instead of in jumps. Thus, in Figure 5, one sees the majority of neurons residing in the upper-leftmost portion of the phase space, with deinactivated calcium currents ( h D 1) but not quite enough input to push them forward through the activation threshold very often. However, because the input is noisy, some cells may randomly walk past Vh . This is reected in the equilibrium prole by the faint stripes in the right half of the phase space (compare with the single-neuron orbit, driven by DC input, of Figure 2). These characteristic stripes indicate bursting cells with four spikes per burst. As Figure 5 shows, a signicant number of cells, after going through a burst cycle, also temporarily get stuck in a calcium-inactivated state (h D 0) near the activation threshold Vh , again owing to stochastic drift. 3.3 Stepped Input Current. An experiment that probes the dynamical behavior of a neuron population involves the stimulation of cells by a step input from one contrast level to another (Mukherjee & Kaplan, 1995). Such experiments give insight into the approach to equilibrium and the degree of
974
A.R.R. Casti et al.
Tonic Response To Step Current
25
Population Simulation , 50 X 50 grid 50 X 100 grid 50 X 200 grid Direct Simulation (e = 1 mV)
Firing Rate (Hz)
20
15
10
5
0
200
220
240
260
280
300
t (ms)
320
340
360
380
400
Figure 6: Firing-rate comparison of the direct simulation (solid jagged curve) with population dynamics simulations (dashed and dotted curves) at various mA grid resolutions in h for a mean current step from I D .4 to I D 1.2 cm 2 . The step increase began at t D 200 ms and terminated at t D 400 ms.
nonlinear response under various changes in contrast. For LGN neurons, it is of particular interest to understand the connection between the stimulus, or the absence of one, and the extent to which visual input is faithfully tracked by relay cell activity and sent forth to the cortex. In this section, we examine the ring activity one might observe in such an experiment for a population of LGN cells. The input is stochastic, with a mean driving rate s 0 that steps from one constant value to another. The population dynamics for this sort of input to integrate-and-re cells operating in tonic mode has been explored in Knight et. al (2000). 3.3.1 Current Step Between Two Calcium Inactivated States. Figure 6 compares the results of a direct numerical simulation with the IFB population mA model for a mean current step from I D .4 to I D 1.2 cm 2 (the mean input 0 ¡1 0 ¡1 rate steps from s D .2 ms to s D .6 ms ). In this case, the prestep
A Population Study of Integrate-and-Fire-or-Burst Neurons
975
equilibrium potential of the average neuron is Veq ¼ ¡53.6 mV, which lies several millivolts above the threshold Vh at which the calcium current is mA deinactivated. Once the current jumps to 1.2 cm 2 , the neurons re in tonic mode since the associated equilibrium Veq ¼ ¡30.7 mV is well above the sodium spiking threshold, Vh D ¡35 mV. The response to a range of step input that does not promote burst ring is primarily linear. Upon activation of the step at t D 200 ms and after a delay of about 30 ms while the membrane potential moves toward threshold, the ring rate r ( t) mimics the input with a step response, aside from a minor overshoot at the peak ring rate. The linearity of the input-output relationship of the tonic mode was veried by a spectral analysis; the transfer function was approximately constant, with only slight deviations at lower frequencies. Further simulations at various input levels revealed that although the input-output relation was not exactly linear, it was far more so than when the population red in burst mode (as discussed below). The three curves generated by the population simulation correspond to varying grid resolution in the calcium coordinate h. The population model and the DNS achieve increasing agreement as the resolution of the population simulation is increased. The reason that the equilibrated ring rate of the population model lies slightly above the mean of the DNS is attributable to nite grid effects, which are further discussed in the appendix.
3.3.2 Current Step from Calcium Deinactivated State to Calcium Inactivated State. In Figure 7 is a comparison of the ring rates of the population mA model at various resolutions with the DNS for a current step from I D .1 cm 2 to I D 1.33. These current values correspond to an equilibrium potential before the step of Veq ¼ ¡62.1 mV (in the absence of Ca2 C dynamics), so the calcium channels are initially deinactivated, and a poststep equilibrium potential of Veq D ¡27 mV, which is well above the ring threshold. Compared to the previous case, one expects much higher ring rates at the onset of the current step because a large fraction of cells are poised to burst. This is indeed reected in the sharper ring-rate peak at the current jump, relative to the equilibrium ring rate to which the population relaxes (compare with Figure 6). Consistent with physiological experiments and analysis of temporal modulation transfer functions (Mukherjee & Kaplan, 1995), the bursting LGN cells in the IFB population nonlinearly modify retinal input and pass to the visual cortex a signicantly altered rendition of the stimulus. Numerical simulations at different input levels, all corresponding to the burst mode, veried that the transfer function was not constant in each case. This behavior of the population activity reects that of the single-neuron IFB model, which Smith et al. (2000) have explored quantitatively for sinusoidal input; we refer to their work for the details.
976
A.R.R. Casti et al.
Burst Response To Step Current 150
Firing Rate (Hz)
Population Simulation , 50 X 50 grid 100 X 50 grid 300 X 50 grid Direct Simulation (e = 1 mV) 100
50
0 150
200
250
300
t (ms)
350
400
Figure 7: Firing-rate comparison of the direct simulation (solid jagged curve) with the IFB population model (dashed and dotted curves) at various grid resmA olutions in V for a current step from I D .1 cm 2 to I D 1.33. The onset of the step occurred at t D 200 ms.
It is interesting to note that the agreement between the DNS and the population simulation is less favorable for coarse grid resolutions (50 gridpoints in both V and h) than in Figure 6. This is a consequence of nite grid inuences that cause some of the population to drift spuriously through the calcium activation transition at Vh , an effect that is much more manifest whenever the equilibrium potential of a typical neuron in the population initially lies near Vh . This issue is elaborated on in the appendix. Figure 8 demonstrates the increase in accuracy when the prestep voltage equilibrium is well removed from Vh . Here, the input was a current mA step from I D 0 cm 2 to I D 1.33, corresponding to a prestep equilibrium ( ) point Veq , heq D (VL , 1) with I D 0. The neurons initially equilibrate in the upper-left corner of the phase space, far enough away from Vh to preclude signicant nite-grid-inuenced drift through the Ca2 C activation threshold. As in Figure 6, the agreement between the direct simulation and the population result is excellent at a moderate resolution.
A Population Study of Integrate-and-Fire-or-Burst Neurons
977
Burst Response To Step Current
Firing Rate (Hz)
250
Population Simulation , 200 X 100 grid Direct Simulation (e = 1 mV)
200
150
100
50
0 150
200
250
300
t (ms)
350
400
mA Figure 8: Comparison of ring rates for a step current from I D 0 cm 2 to 1.33. Solid curve: Direct simulation with 2 D 1 mV. Dashed curve: Population dynamics simulation on a 200 £ 100 grid.
4 Discussion To capture the dynamical range of LGN cells that may re in a burst or tonic mode, a neuron model of two or more state variables is required, since at least two fundamental timescales comprise the intrinsic dynamics of the burst mode: the relatively long interval between bursts (around 100 ms) and the short interspike interval (about 4 ms) of sodium action potentials that ride the low-threshold calcium spike. Using the single-cell IFB model of Smith et al. (2000) as a springboard, this work presents the rst simulation of a population equation with a two-dimensional phase space. Most previous studies using the population method focused on the single state-space variable integrate-and-re model (Knight, 1972; Knight et al., 1996, 2000; Nykamp & Tranchina, 2000; Omurtag, Knight, and Sirovich, 2000; Sirovich et al., 2000). Here, the computationally efcient simulations of an LGN population under a variety of stimulus conditions were seen to be in excellent agreement with direct numerical simulations of an analogous discrete net-
978
A.R.R. Casti et al.
work (see Figures 3, 6, 7, and 8), as well as with special analytical solutions (see Figures 3 and 4). Although the role of the intrinsic variability of LGN cells in visual processing is still unknown, a large body of experimental evidence indicates that the dual response mode—burst or tonic—has a signicant effect on the faithfulness with which a retinal stimulus is transmitted to cortex. This fact may be related to attentional demands. In alert animals, the burst mode, being a more nonlinear response, could serve the purpose of signaling sudden changes in stimulus conditions (Guido & Weyand, 1995; Sherman, 1996). The tonic mode, a nearly linear response mode, presumably takes over when the cortex demands transmission of the details. The population model of IFB neurons mirrors these qualitative features of the dual response modes. For a population initially hyperpolarized below Vh , where calcium channels are deinactivated, the response to a step input was large and nonlinear (see Figures 7 and 8). When the population was initialized with inactivated calcium currents at membrane potentials above Vh , the ring-rate response tracked the stimulus much more faithfully, indicating a primarily linear response (which a spectral analysis veries explicitly). Further, an initially hyperpolarized IFB population driven by a mA large current (I > 1.05 cm 2 for the parameters of Table 1) will re a burst of calcium-activated spikes and then settle into a tonic ring mode. This is consistent with the experiment of Guido and Weyand (1995), in which relay cells of an awake cat red in burst mode at stimulus onset during the early xation phase and then switched to a tonic ring pattern thereafter. The IFB model, being of low dimension, thus shows great promise as an efcient means of simulating realistic LGN activity in models of the visual pathway. Previous simulations of the early stages of visual processing (retina ! LGN ! V1) typically did not incorporate LGN dynamics (however, see Tiesinga & JosÂe, 2000). Typically, the LGN input used is a convolved version of the retinal stimulus, which is then relayed to the cortex (McLaughlin, Shapley, Shelley, & Wielaard, 2000; Nykamp & Tranchina, 2000; Omurtag, Kaplan, Knight, & Sirovich, 2000; Somers, Nelson, & Sur, 1995). In such simulations, no account is made of the intrinsic variability of the LGN cells or the effects of feedback from the cortex or other areas. Because the convolution of retinal input with a lter is a linear operation, most models effectively simulate LGN cells in their tonic mode. Yet the burst response mode of relay cells is certainly an important feature of LGN cells whatever the arousal level of the animal in question. Although this may be of less concern in feedforward models used to study orientation tuning in V1, for instance, it is a necessary consideration of any model of cortical activity when stimulus conditions promote strong hyperpolarization for signicant durations, which may arise realistically from variable levels of alertness, or for simulations in which the stimulus is weak.
A Population Study of Integrate-and-Fire-or-Burst Neurons
979
A dynamically faithful model of relay cell activity, dictated by experiment, is likely needed to assess the role of the massive feedback connections on the cortico-thalamic pathway. There is some evidence that feedback from a layer 6 neuron in V1 to the LGN may play a role in orientation tuning by synchronizing the spiking of relay cells within its receptive eld (Sillito, Jones, Gerstein, & West, 1994). This suggests the possibility that cortical cells provide very specic afferent connections to reinforce the activity of the LGN neurons that excited them in the rst place. Anatomical support for this has been provided by the experiments of Murphy, Duckett, and Sillito (1999), who observed that the corticofugal axons, though sparse, exhibit localized clustering of their boutons into elongated anatomical “hot spots” that synapse upon a relatively large number of target cells in the LGN and reticular nucleus. They also demonstrated a high correlation between the major axis of the elongated array of boutons and the orientation preference of the cortical cells from which they originated. A plausible conjecture is that the cortico-thalamic feedback serves the purpose of enhancing the response of salient features such as edge orientations in the retinal input. If so, then a more dynamically realistic LGN model than those used to date is called for. In any event, the relative importance of feedback to the LGN, as opposed to intracortical connectivity and feedforward convergence, say, in the tuning of cortical cells to various modalities such as orientation and spatial frequency, is an important issue. Future work with the population method is underway to simulate a simplied version of the thalamocortical loop, with realistic dynamical models for the LGN, layers 4 and 6 of the primary visual cortex, and the inhibitory interneurons of the thalamic reticular nucleus. The aim will be to study the functional nature of the circuitry that connects the various levels of the early visual pathway and investigate in particular the role that feedback plays in visual pattern analysis. Appendix: Numerical Methods A.1 Direct Simulation. The state variable v D ( V, h ) for each cell in the network is governed by the ordinary differential equation (ODE), X dv d (t ¡ tk ) , D F(v) C eˆ V 2 dt k
(A.1)
where eˆ V is the unit vector in the V direction and F(v) is dened by equation 2.10. The solution that corresponds to the streaming motion alone, dv / dt D F(v) ´ ( FV (v, h) , Fh ( v, h ) ), is formally denoted by the time evolution opera(1) tor eQ ( t) : v(t) D eQ
(1) (
t)
v(0).
(A.2)
980
A.R.R. Casti et al.
The solution that corresponds to the synaptic input alone similarly can be written as v(t) D eQ
(2) (
t)
v(0) .
(A.3)
Equation A.3 has a simple, explicit form for the case of xed nite jumps: eQ
(2) (
t)
v(0) D v(0) C2 n (t) .
(A.4)
Here, n ( t) is the number of impulses that have arrived during time t. It is an integer chosen randomly from the nth Poisson distribution, Pn (t) D (s t) n e¡st / n!, with mean arrival rate s. We are interested in solutions of the form v(t) D e ( Q
(1) (
t ) C Q (2) ( t) )
v(0) .
(A.5)
According to the Baker-Campbell-Hausdorff lemma (Sakurai, 1994), a second-order accurate splitting of the exponential operator, equation A.5, is (1) (2) (1) v(t C D t) D eQ (D t/ 2) eQ ( D t) eQ ( D t / 2) v(t) C O ( D t3 ) .
(A.6)
We use a second-order Runge-Kutta method for the streaming operator (1) eQ . Thus, the above operator splitting method provides an efcient secondorder accurate numerical scheme for integrating equation A.1. However, the streaming direction eld of the IFB neuron contains discontinuous changes in the voltage variable V due to the Heaviside calcium-channel activation function, equation 2.4, and the pointwise (in h) reset condition. We now describe how we handle these discontinuities numerically with the same second-order accuracy. Suppose the discontinuity occurs when v D vd . After a time step D t, if vd is found to lie between v (t ) and v (t C D t) , then the integration is performed from the current state up to the discontinuity. Let the state at time t be denoted ( v, h ) . Dene v D 12 ( vd C v ) , and let D t denote the time it takes to move from the present state (v, h ) to the state ( v, h ) , which lies between the present state and the point where the trajectory crosses the discontinuity, (vd , h¤ ) . The time to reach the discontinuity is D t¤ . By Taylor expanding the direction eld at ( v, h ) we nd
Dt D
v ¡v C O ( D t2 ) . h)
FV ( v,
(A.7)
We remark that FV ( v, h ) D 0 almost never occurs in the phase plane for the cases of interest. The solution for the calcium channel ODE, equation 2.5, is then written (2) (2) 1 V h D eQh ( D t ) h D eQh ( 2 ( vd ¡v) / F ( v,h ) ) h C O ( D t2 ) .
(A.8)
A Population Study of Integrate-and-Fire-or-Burst Neurons
981
Next, after the direction eld is Taylor expanded about ( v, h ) , one can show that vd D v C FV (v, h) D t¤ C O ( D t3 ) ,
(A.9)
whence
D t¤
D
± FV
1( C 2 vd
vd ¡ v ² C O ( D t3 ) . (2) 1 V ( v,h) ) ( ¡v) Q ( v F / d 2 v) , e h h
(A.10)
Consequently, the point where the trajectory crosses the discontinuity is found to be (2) (vd , h¤ ) D (vd , eQh ( D t¤ ) h C O (D t3 ) ) .
(A.11)
A.2 Population Simulation. Equation 2.7 is linear in r. However, due to the boundary conditions in the V-direction (population exiting the phase space at threshold resurfaces in the middle of the grid) and the tendency of the population to pile up at h D 0 and h D 1, it is necessary to use relatively sophisticated methods to integrate the equations. In particular, if we simply discretize the grid and expand the derivative terms to second order, oscillations due to the discretization become amplied when the population piles up at the h boundaries and population ux is reintroduced into the grid by the reset boundary condition. The oscillations then cause the population density to take on negative values in regions of the phase space. For this reason, we employed a second-order total-variation-diminishing (TVD) scheme. The undesirable oscillations in nite difference-based schemes used to evolve advection equations may be overcome by the TVD algorithm (Hirsch, 1992). The scheme that we use for the advective term in the population equation 2.12 is a second-order upwind method, for which we describe here the one-dimensional version. The evolution of the conserved variable, r, is governed by d¡ dr i D ¡ Dx dt
µ ¤ (R ) fi C 1 C 2
¶ 1 C 1 ¡ ¤ ( R) ¤ ( R) ( ) ( ) C y 1 fi ¡ fi¡ 1 y 3 fi C 1 ¡ fi C 3 , 2 i¡ 2 2 iC 2 2 2
(A.12)
where an i subscript indicates the ith grid zone, fi D ( Fr ) i is a numerical approximation of the streaming ux (where F is a component of the direction eld), and the difference operator d ¡ [ui ] D ui ¡ ui¡1 . The parameters y C and y ¡ are ux limiters that are dynamically adjusted to control spurious oscillations arising from large, streaming ux gradients, which in our case occur at lines of discontinuity (V D Vr and V D Vh ) and along lines at which the population tends to pile up (h D 0 and h D 1). The ux limiter used was
982
A.R.R. Casti et al.
Roe’s Superbee limiter, which is dened by 2
yC 1 D y 4 i¡ 2
2
yi¡C 3 D y 4 2
¤ ( R)
fi C 1 ¡ fi C 1 2
¤ ( R)
fi ¡ fi¡ 1 2
¤( ) fi ¡ fi C 1R 2 ¤ ( R) fi C 1 ¡ fi C 3 2
3 5,
(A.13)
3 5,
y [r] D max[0, min(2r, 1) , min(r, 2) ],
(A.14) (A.15)
where the rst-order Roe ux is dened as ¤ ( R)
1 1 ( fi C fi C 1 ) ¡ (ri C 1 ¡ ri ) ai C 1 2 2 2 1 ( Fi C Fi C 1 ) . D 2
fi C 1 D
(A.16)
ai C 1
(A.17)
2
2
This method eliminates undesirable oscillations due to discretization and reduces negative densities to values on the order of the numerical round-off error. We found one undesirable feature with this method. If the direction eld points in opposite directions between grid points n and n C 1, the populations on either side diverge, one tending to 1 and the other to ¡1. To cure this problem, we set the Roe ux, equation A.16, to zero at the grid boundary between grid points n and n C 1. With this modication, the method works extremely well for the advective terms in equations 2.12 and 2.22. The other terms of the population equation, corresponding to the stochastic input, were also discretized to second order. For the nite-jump model, equation 2.12, the source term s ( t)r (V ¡ 2 h, t) was discretized by linear interpolation. The diffusion term in the diffusion approximation, equation 2.22, was discretized by a second-order central difference. A.3 Discrepancies Between the Direct Simulation and the Population Simulation. We commented in section 3.3 that the accuracy of the population simulation with a step input current is sensitive to the prestep voltage equilibrium of the neuron. In Figure 7, for example, we noted that a ne resolution (300 grid points in V) was necessary to achieve reasonable agreement in the ring rates of the direct simulation and the TVD simulation of the population equation, 2.12. For other cases in which the prejump membrane potential equilibrium was well away from Vh , we generally found that the population simulation with roughly 100 grid points in each direc-
A Population Study of Integrate-and-Fire-or-Burst Neurons
983
tion converged to a direct simulation of sufciently many neurons. We now comment on why this is so. In cases where the prejump state of the average neuron is far away from Vh , the main source of numerical error arose from the resolution in the recovery variable h. Since all the neurons equilibrate at h D 0 before the current step, the TVD algorithm imposes large dissipation at the bottom boundary of the phase space in order to avoid grid-scale oscillations in the density r. As the grid becomes ner in the h-direction, corresponding to diminished ux limiting and less numerical dissipation, the predicted ring rates of the population model converge to that of the direct simulation (see Figure 6). Population model ring rates err slightly on the high side owing to the nite accuracy, as some neurons achieve their equilibrium at a value of h slightly larger than zero. These neurons feel a slightly larger depolarizing input, due to a nonvanishing T-channel current, and thus achieve the ringrate threshold at the onset of the current step faster than their counterparts, which lie initially at h D 0. When Veq ¼ Vh before the current step, the TVD simulations are more sensitive to the resolution in the voltage variable. As seen in Figure 7, with a coarse resolution of 50 grid points in V, the ring rate of the population simulation is off by approximately 56% at peak ring. As the resolution is increased to 300 grid points in V, the maximal error drops to about 8%. These observations are attributable to the spurious drift of neurons through Vh owing to nite grid effects, something from which the direct simulation does not suffer. We also observed that the calculated ring rates were relatively insensitive to the resolution in h, which here was chosen to be 50 grid points. Comparison of the equilibrium proles in Figure 9 gives a better understanding of the ring-rate discrepancies between the direct simulation and the population dynamics simulation shown in Figure 7. In the direct simulation, stochastic voltage jumps across Vh cause some neurons to equilibrate at h D 0 and some at h D 1, as reected in the top panel of Figure 9. Similar peaks in the density arise in the population simulation. However, as the lower panel of Figure 9 shows, many more neurons lie near h D 0 relative to the direct simulation. Consequently, more neurons are poised to run through a burst cycle at the onset of the step current in the direct simulation, which is why the peak ring rate associated with the bursting event is greater than that calculated in the population model. This error decreases as the V¡direction mesh of the population code is made ner. Increases in resolution beyond 50 grid points in the h-direction did not alter the results signicantly. To deal effectively with these nite grid issues, one option is to employ a variable phase-space mesh that has ner resolution near points of discontinuity along the voltage axis (at Vh and Vr ), and near the grid lines at h D 0 and h D 1 where the population tends to pile up. We leave such renements to future work.
984
A.R.R. Casti et al.
Direct Simulation
4
3
r
2
1
0 1
0.8
0.6
0.4
0.2
0
65
60
50
55
45
40
35
V
h
Population Simulation , 50 X 50 grid
4
3
r
2
1
0 1
0.8
0.6
0.4
h
0.2
0
65
60
50
55
45
40
35
V
Figure 9: Equilibrium population distributions corresponding to a mean drivmA ing current I D .1 cm 2 . (Top): Direct simulation with stochastic driving (xed jumps of size 2 D 1 mV). (Bottom): Population dynamics simulation (50 £ 50 grid). Note that many more neurons equilibrate near h D 0 in the population dynamics simulation relative to that of the direct simulation.
Acknowledgments This work has been supported by NIH/NEI EY 11276, NIH/NIMH MH50166, NIH/NEI EY01867, NIH/NEI EY9314, DARPA MDA 972-01-1-0028, and ONR N00014-96-1-0492. E. K. is the Jules and Doris Stein Research to Prevent Blindness Professor at the Ophthalmology Department at Mount Sinai.
A Population Study of Integrate-and-Fire-or-Burst Neurons
985
References Abramowitz, M. A., & Stegun, I. A. (1972).Handbookof mathematicalfunctions with formulas, graphs, and mathematical tables. Washington, DC: U.S. Government Printing Ofce. Guido, W., Lu, S., & Sherman, S. M. (1992). Relative contributions of burst and tonic responses to the receptive eld properties of lateral geniculate neurons in the cat. J. Neurophysiol., 68, 2199–2211. Guido, W., & Weyand, T. G. (1995). Burst responses in lateral geniculate neurons of the awake behaving cat. J. Neurophysiol., 74, 1782–1786. Hirsch, C. (1992). Numerical computation of internal and external ows. New York: Wiley. Jahnsen, H., & Llinas, R. (1982). Electrophysiology of mammalian thalamic neurons in vitro. Nature, 297, 406–408. Knight, B. W. (1972). Dynamics of encoding in a population of neurons. J. Gen. Physiol., 59, 734–766. Knight, B. W. (2000).Dynamics of encoding in neuron populations: Some general mathematical features. Neural Comp., 12, 473–518. Knight, B. W., Manin, D., & Sirovich, L. (1996). Dynamical models of interacting neuron populations. In E. C. Gerf (Ed.), Symposium on Robotics and Cybernetics: Computational Engineering in System Applications. Lille, France: Cite Scientique. Knight, B. W., Omurtag, A., & Sirovich, L. (2000). The approach of a neuron population ring rate to a new equilibrium: An exact theoretical result. Neural Comp., 12, 1045–1055. Livingstone, M. S., & Hubel, D. H. (1981). Effects of sleep and arousal on the processing of visual information in the cat. Nature, 291, 554–561. McCormick, D. A., & Feeser, H. R. (1990). Functional implications of burst ring and single spike activity in lateral geniculate relay neurons. Neurosci., 39, 103–113. McLaughlin, D., Shapley, R., Shelley, M., & Wielaard, D. J. (2000). A neuronal network model of macaque primary visual cortex (V1): Orientation selectivity and dynamics in the input layer 4Ca. PNAS, 97, 8087–8092. Mukherjee, P., & Kaplan, E. (1995). Dynamics of neurons in the cat lateral geniculate nucleus: In vivo electrophysiology and computational modeling. J. Neurophysiol., 74, 1222–1243. Mukherjee, P., & Kaplan, E. (1998). The maintained discharge of neurons in the cat lateral geniculate nucleus: Spectral analysis and computational modeling. Vis. Neurosci., 15, 529–539. Murphy, P., Duckett, S., & Sillito, A. (1999). Feedback connections to the lateral geniculate nucleus and cortical response properties. Science, 286, 1552–1554. Nykamp, D., & Tranchina, D. (2000). A population density approach that facilitates large-scale modeling of neural networks: Analysis and an application to orientation tuning. J. Comp. Neurosci., 8, 19–50. Omurtag, A., Kaplan, E., Knight, B. W., & Sirovich, L. (2000). A population approach to cortical dynamics with an application to orientation tuning. Network: Comput. Neural Syst., 11, 247–260.
986
A.R.R. Casti et al.
Omurtag, A., Knight, B. W., & Sirovich, L. (2000). On the simulation of large populations of neurons. J. Comp. Neurosci., 8, 51–63. Sakurai, J. J. (1994). Modern quantum mechanics. Reading, MA: Addison-Wesley. Sherman, S. M. (1996). Dual response modes in lateral geniculate neurons: Mechanisms and functions. Vis. Neurosci., 13, 205–213. Sherman, S. M., & Guillery, R. W. (1996). Functional organization of thalamocortical relays. J. Neurophysiol., 76, 1367–1395. Sillito, A. M., Jones, H. E., Gerstein, G. L., & West, D. C. (1994). Feature-linked synchronization of thalamic relay cell ring induced by feedback from the visual cortex. Nature, 369, 479–482. Sirovich, L., Knight, B. W., & Omurtag, A. (2000). Dynamics of neuronal populations: The equilibrium solution. SIAM J. Appl. Math., 60, 2009–2028. Smith, G. D., Cox, C. L., Sherman, S. W., & Rinzel, J. (2000). Fourier analysis of sinusoidally driven thalamocortical relay neurons and a minimal integrateand-re-or-burst model. J. Neurophysiol., 83, 588–610. Somers, D. C., Nelson, S. B., & Sur, M. (1995). An emergent model of orientation selectivity in cat visual cortical simple cells. J. Neurosci., 15, 5448–5465. Steriade, M., & Contreras, D. (1995). Relations between cortical and thalamic cellular events during transition from sleep patterns to paroxysmal activity. J. Neurosci., 15, 623–642. Tiesinga, P.H.E., & Jos`e, J. V. (2000). Synchronous clusters in a noisy inhibitory neural network. J. Comp. Neurosci., 9, 49–65. Troy, J. B., & Robson, J. G. (1992). Steady discharges of X and Y retinal ganglion cells of cat under photopic illuminance. Vis. Neurosci., 9, 535–553. Tuckwell, H. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press. Received February 13, 2001; accepted August 10, 2001.
NOTE
Communicated by Ad Aertsen
Stable Propagation of Activity Pulses in Populations of Spiking Neurons Werner M. Kistler
[email protected] Swiss Federal Institute of Technology Lausanne, 1015 Lausanne EPFL, Switzerland, and Neuroscience Institute, Department of Anatomy, FGG, Erasmus University Rotterdam, 3000DR Rotterdam, The Netherlands
Wulfram Gerstner wulfram.gerstner@ep.ch Swiss Federal Institute of Technology Lausanne, 1015 Lausanne EPFL, Switzerland
We investigate the propagation of pulses of spike activity in a neuronal network with feedforward couplings. The neurons are of the spikeresponse type with a ring probability that depends linearly on the membrane potential. After ring, neurons enter a phase of refractoriness. Spike packets are described in terms of the moments of the ring-time distribution so as to allow for an analytical treatment of the evolution of the spike packet as it propagates from one layer to the next. Analytical results and simulations show that depending on the synaptic coupling strength, a stable propagation of the packet with constant waveform is possible. Crucial for this observation is neither the existence of a ring threshold nor a sigmoidal gain function—both are absent in our model—but the refractory behavior of the neurons.
1 Introduction Recently, the propagation of sharp pulses of spike activity through various types of neuronal networks has attracted a lot of attention. There are basically two complementary scenarios where a temporally precise transmission of spikes has been investigated in model studies: spatially extended networks with distance-dependent couplings and layered feedforward structures of pools of neurons. Spatially extended networks have properties similar to those of excitable media and exhibit, for example, solitary waves of spike activity (Kistler, Seitz, & van Hemmen, 1998; Ermentrout, 1998; Bressloff, 1999; Kistler, 2000). Layered feedforward networks, also known as synre chains (Abeles, 1991), can be seen as a discretized version of the former, where the smooth propagation of a wave of activity is replaced by a discrete transmission of spikes from one layer to the next. Similar to solitary Neural Computation 14, 987–997 (2002)
° c 2002 Massachusetts Institute of Technology
988
Werner M. Kistler and Wulfram Gerstner
waves in spatially extended networks, the transmission function for spikes can produce an attractive xed point for the shape of the ring-time distribution in each layer. Such a “spike packet” can propagate in a stable way from one layer to the next (Abeles, 1991; Aertsen, Diesmann, & Gewaltig, 1996; MarÏsa lek, Koch, & Maunsell, 1997; Gewaltig, 2000; Diesmann, Gewaltig, & Aertsen, 1999). Similarly to Abeles (1991) and Diesmann et al. (1999), we consider in this article a chain of M pools of identical neurons with feedforward coupling. Each neuron is described by the spike response model, a generalization of the integrate-and-re model. The spike train of neuron i is formalized as a P f f sum of d functions, Si ( t) D f d (t ¡ ti ) , where the ring times ti of neuron i are labeled by an upper index f . The membrane potential ui of a given neuron is the linear response to pre- and postsynaptic action potentials, ui ( t, tOi ) D
X 6 i j, j D
Z vij
1 0
dt0 2 ( t0 ) Sj ( t ¡ t0 ) C g(t ¡ tOi ) .
(1.1)
Here, the response kernel 2 describes the form of an elementary postsynaptic potential, vij is the synaptic coupling strength, and g is a (negative) afterpotential that accounts for the reset of the membrane potential after the last f f spike at tOi D maxfti |ti < t, f D 1, 2, . . .g and for refractoriness (Gerstner & van Hemmen, 1992; Gerstner, Ritz, & van Hemmen, 1993; Kistler, Gerstner, & van Hemmen, 1997). In the absence of synaptic input, ui D 0 corresponds to the resting potential of the neuron. The inuence of the last-but-one and earlier spikes is neglected so that spike triggering can be described by an input-dependent renewal process (Cox, 1962). Noise is implemented in the model by a stochastic spike-triggering mechanism. New spikes are dened through a stochastic process that depends on the value of the membrane potential. The probability that a spike will occur in the innitesimal interval [t, t C dt) is probfspike in [t, t C dt) g D f [ui (t, tOi ) ] dt.
(1.2)
The function f is called escape rate (or hazard function) (Plesser & Gerstner, 2000). For simplicity, we assume a semilinear dependence of the ring probability and the membrane potential, f ( u ) D [u] C ,
(1.3)
with [u] C D u if u > 0 and [u] C D 0 elsewhere. f (0) D 0 implies that the neuron is not spontaneously active. This completes the denition of our single-neuron model. If we assume that neuron i has red its last action potential at time tOi , we can calculate the probability si ( t, tOi ) that it will “survive” without ring
Propagation of Activity Pulses
989
until time t > tOi , » Z si ( t, tOi ) D exp ¡
t tOi
¼ f [ui ( t0 , tOi ) ] dt0 ,
(1.4)
(cf. Cox, 1962; Gerstner, 2000). The probability density for the next ring time is thus pi ( t, tOi ) D ¡
@ @t
» Z si ( t, tOi ) D f [ui (t, tOi ) ] exp ¡
t tOi
¼ f [ui ( t0 , tOi ) ] dt0 .
(1.5)
We consider M pools containing N neurons each that are connected in a purely feedforward manner; neurons from pool n project only to pool n C 1, and there are no synapses between neurons from the same pool. We assume all-to-all connectivity between two successive pools with uniform synaptic weights vij D v / N. The membrane potential of a neuron i from pool n C 1 is thus Z 1 v X 0 0 0 O ( ) ui t, ti D 2 ( t ) Sj ( t ¡ t ) dt C g(ti ¡ tOi ) N j2C ( n) 0 Z D v
1 0
0 0 0 2 ( t ) An ( t ¡ t ) dt C g ( ti ¡ tOi ) ,
(1.6)
( ) with i 2 C ( n C 1) , C Pn the index set of all neurons that belong to pool n, and An ( t) D N ¡1 j2C ( n) Sj ( t) the population activity of pool n. Integration of An over a short interval of time thus gives the portion of neurons from pool n that re an action potential during this interval. The coupling strength between two successive pools v describes the amplitude of the resulting postsynaptic potential if all neurons in the presynaptic pool would re synchronously. A single action potential thus produces only weak postsynaptic potentials that, according to equation 1.3, have only a low chance of triggering the neuron. The spike trains Si and the population activity An are random variables. Each pool is supposed to contain a large number of neurons (N À 1) so that we can replace the population activity An in equation 1.6 by its expectation value AN n , which is given by a normalization condition (Gerstner, 2000), Z
t ¡1
sn (t, tO) AN n (tO) dtO D 1 ¡ sn ( t) .
(1.7)
Here, sn ( t) D sn ( t, ¡1) accounts for those neurons that have been quiescent in the past (i.e., have not red up to time t). The strong law of large numbers ensures that the population activity An converges in probability to AN n (in
990
Werner M. Kistler and Wulfram Gerstner
the weak topology) as the number of neurons in the pool goes to innity, » prob
Z lim
1
N!1 ¡1
Z An ( t) w ( t) dt D
for any test function w 2 C
1(
1 ¡1
¼ AN n ( t) w (t) dt D 1,
(1.8)
R ) (cf. Lamperti, 1996).
2 Pulse Propagation Simulation studies (Diesmann et al., 1999) and analytic calculations (Gewaltig, 2000) suggest that a pronounced refractory behavior is required in order to obtain a stable propagation of a spike packet from one layer to the next. If neurons were allowed to re more than once within one spike packet, the number of spikes per packet and thus the width of the packet would grow in each step. Therefore, we use a strong and long-lasting afterpotential g so that each neuron can re only once during each pulse. The survivor function thus equals unity for the duration tAP of the afterpotential; sn ( t, tO) D 1 for 0 < t ¡ tO < tAP and tAP being large as compared to the typical pulse width. Let us denote by Tn the moment when a pulse packet arrives at pool n. We assume that for t < Tn , all neurons in layer n have been inactive—An ( t) D 0 for t < Tn . Differentiation of equation 1.7 with respect to t (and dropping bars in order to keep notation simple) leads to An ( t) D ¡
@ @t
» Z sn (t ) D f [un ( t)] exp ¡
t ¡1
¼ f [un (t0 ) ] dt0 ,
(2.1)
with Z un ( t ) D v
1 0
0 0 0 2 ( t ) An¡1 ( t ¡ t ) dt .
(2.2)
Equation 2.1 provides an explicit expression for the ring-time distribution An ( t) in layer n as a function of the time course of the membrane potential. The membrane potential un ( t) in turn depends on the time course of the activity An¡1 (t ) in the previous layer, as shown in equation 2.2. Note that both equations are independent of the network size N; their derivation, however, relies on the strong law of large numbers so that N À 1 is implicitly assumed. Both equations 2.1 and 2.2 can easily be integrated numerically; an analytic treatment, however, is difcult even if a particularly simple form of the response kernel 2 is chosen. Following Diesmann et al. (1999), we therefore concentrate on the rst few moments of the ring-time distribution in order to characterize the transmission properties. More precisely, we approximate the ring-time distribution An¡1 (t) by a gamma distribution and calculate— in step i—the zeroth, rst, and second moment of the resulting membrane
Propagation of Activity Pulses
991
potential in the following layer n. In step ii, we use these results to approximate the time course of the membrane potential by a gamma distribution and calculate the moments of the corresponding ring-time distribution in layer n. We thus obtain an analytical expression for the amplitude and the variance of the spike packet in layer n as a function of amplitude and variance of the spike packet in the previous layer. In step i, we assume that the activity An¡1 (t) in layer n ¡ 1 is given by a gamma distribution with parameters an¡1 and ln¡1 , that is, An¡1 ( t) D an¡1c an¡1 ,ln¡1 ( t).
(2.3)
Here, an¡1 is the portion of neurons of layer n ¡ 1 that contribute to the spike packet, c a,l ( t) D ta¡1 e¡t / l H (t ) / [C (a)la ] the density function of the gamma distribution, C the complete gamma function, and H the Heaviside step function with H ( t) D 1 for t > 0 and H ( t) D 0 else. The mean m and the variance s 2 of a gamma distribution with parameters a and l are m D a l and s 2 D al2 , respectively. The membrane potential un ( t) in the next layer results from a convolution of An¡1 with the response kernel 2 . This is the only point where we have to refer explicitly to the shape of the 2 kernel. For simplicity, we use a normalized a function, 2 ( t) D
t ¡t/ t e H (t) ´ c 2,t ( t) , t2
(2.4)
with time constant t . The precise form of 2 is not important; similar results hold for a different choice of 2 . In the present context, spikes are mostly triggered during the raising phase of the (excitatory) postsynaptic potential. We therefore set t D 1 ms for fast AMPA-mediated potentials rather than describe the passive membrane time constant, which is about one order of magnitude larger. We want to approximate the time course of the membrane potential by a gamma distribution c aQ n , lQ n . The parameters1 aQ n and lQ n are chosen so that the rst few moments of the distribution are identical to those of the membrane potential, that is, un ( t) ¼ aQ nc aQ n , lQ n ( t) ,
(2.5)
with Z
1 0
!
tn un (t) dt D
Z
1 0
tn aQ nc aQ n , lQ n ( t) dt,
n 2 f0, 1, 2g.
(2.6)
1 We use a tilde to identify parameters that describe the time course of the membrane potential. Parameters without a tilde refer to the ring-time distribution.
992
Werner M. Kistler and Wulfram Gerstner
As far as the rst two moments are concerned, a convolution of two distributions reduces to a mere summation of their mean and variance. Therefore, the convolution of An¡1 with 2 basically translates the center of mass by 2t and increases the variance by 2t 2 . Altogether, amplitude, center of mass, and variance of the time course of the membrane potential in layer n are 9 aQn D van¡1 , > = mQ n D m n¡1 C 2t, (2.7) > ; 2 2 2 sQ n D sn¡1 C 2t , respectively. The parameters a Q n and lQ n of the gamma distribution are directly related to mean and variance: a Q n D mQ 2n / sQ n2 , lQ n D sQ n2 /mQ n . In step ii, we calculate the ring-time distribution that results from a membrane potential with time course given by a gamma distribution as in equation 2.5. We use the same strategy as in step i, that is, we calculate the rst few moments of the ring-time distribution and approximate it by the corresponding gamma distribution, An ( t) ¼ anc an ,ln ( t) .
(2.8)
The zeroth moment of An ( t) (the portion of neurons in layer n that participates in the activity pulse) can be cast in a particularly simple form; the expressions for higher-order moments, however, contain integrals that have to be evaluated numerically. For amplitude, center of mass, and variance of An ( t) , we nd 9 an D 1 ¡ e ¡Qan , > > = (1) m n D mn , (2.9) h i > (2) (1) 2 > ; 2 sn D mn ¡ mn , with ± ² ¡1 Z mn( k) D 1 ¡ e ¡Qan
1 0
µ Z un ( t) exp ¡
t ¡1
¶ un ( t0 ) dt0 tk dt
aQ n lQ nk ¢ D ¡ 1 ¡ e ¡Qan C ( a Q n) Z 1 £ exp[¡t ¡ aQ n C ( a Q n , 0, t) / C (aQ n ) ]tk¡1 C aQ n dt
(2.10)
0
being the kth moment of the ring-time distribution (see equation 2.1) that results from a gamma-shaped time course of the membrane potenRt tial. C ( z, t1 , t2 ) D t12 tz¡1 e ¡t dt is the generalized incomplete gamma function. The last equality in equation 2.10 has been obtained by substituting aQ nc aQ n , lQ n (t ) for un ( t) .
Propagation of Activity Pulses
993
A combination of equations 2.7 and 2.9 yields explicit expressions for the parameters ( an , m n , sn ) of the ring-time distribution in layer n as a function of the parameters in the previous layer. The mapping (an¡1 , m n¡1 , sn¡1 ) ! ( an , m n , sn ) is closely related to the neural transmission function for pulsepacket input, as discussed by Diesmann et al. (1999). Particularly interesting is the iteration that describes the amplitude of the spike packet, an D 1 ¡ e ¡v an¡1 ,
(2.11)
which is independent of the shape of the spike packet. If v · 1, the mapping an¡1 ! an has a single (globally attractive) xed point at a D 0. In this case, no stable propagation of spike packets is possible since any packet will nally die out.2 For v > 1, a second xed point at a1 2 (0, 1) emerges through a pitchfork bifurcation. The new xed point is stable, and its basin of attraction contains the open interval (0, 1) . The fact that the all-off state at a D 0 is unstable for v > 1 is related to the fact that there is no real ring threshold in our model. Figure 1 shows examples of the propagation of a spike packet for various synaptic coupling strengths and initial conditions. Theoretical predictions based on equations 2.7 and 2.9 are compared to simulations of a network with N D 1000 neurons per layer. In each subgure, a series of bar charts shows the ring-time distribution of neurons from layers n D 0 to n D 5. The ow eld illustrates the evolution of the amplitude and the width of the spike packet as it propagates from one layer to the next. In Figure 1A, a small coupling strength has been chosen (v D 1) so that iteration 2.11 has only a single xed point at a D 0. Therefore, any spike packet will die out whatever is the initial ring-time distribution in layer n D 0. Figure 1B is another example for v D 2. Here, the iteration 2.11 has a stable xed point at a ¼ 0.80, and both simulations and theory show that this xed point corresponds to a stable propagation of spike packets from one layer to the next. Finally, in Figure 1C (v D 4), we demonstrate that the iteration of ( an¡1 , m n¡1 , sn¡1 ) ! ( an , m n , sn ) converges to a mere translation of the spike packet with an approximately xed waveform. This waveform is globally attractive so that even a weak and broadly tuned initial ring distribution will become sharper and form a narrow spike packet. 3 Discussion Any information processing scheme that relies on the precise timing of action potentials obviously requires a means to transmit spikes without 2
in n.
The decay of the activity is exponential in n if v < 1, for v D 1 the decay is polynomial
994
Werner M. Kistler and Wulfram Gerstner
Propagation of Activity Pulses
995
destroying their temporal structure. In this article, we have shown analytically that despite the noise in the spike-generating mechanism, packets of (almost) synchronous spikes can propagate in a feedforward structure from one layer to the next so that their width is preserved, provided that the synaptic coupling strength is sufciently large. Our approach is closely related to the concept of synre-chains (Abeles, 1991). While Abeles stresses the importance of a nonlinear transfer function,3 our results are not based on a nonlinear transfer function but are a direct consequence of the refractory behavior of the neurons. Noise and broad postsynaptic potentials tend to smear out initially sharp spike packets. If, however, the synaptic coupling is strong enough, then postsynaptic neurons will start ring during the raising phase of their membrane potential. If, in addition, these neurons show pronounced refractory behavior, then ring will cease even before the postsynaptic potentials have reached their maximum. With respect to precise timing, refractoriness thus counteracts the effects of noise and synaptic transmission. As a consequence of the linear transfer function, no bistability between asynchronously ring neurons and a propagating pulse could be observed in our model. Depending on the synaptic coupling strength, there is always only one stable xed point: either the all-off state or a propagating pulse. Figure 1: Facing page. Propagation of spike packets through a feedforward network. (A) Evolution of the ring-time distribution of a spike packet as it propagates from one layer to the next (n D 0, 1, . . . , 4). The neurons in layer n D 0 are driven by an external input that creates a sharp initial spike packet given by a gamma distribution with a0 D 10 and l 0 D 0.1. Initial amplitude is a0 D 1. The bars (bin width 0.2) represent the results of a simulation with N D 1000 neurons per layer; the solid line is the ring-time distribution as predicted by the theory; cf. equations 2.7 and 2.9. The “ow eld” to the right characterizes the transmission function for spike packets in terms of their amplitude an and p width sn D an ln . Open symbols connected by a dashed line represent the simulations shown to the left; lled symbols connected by solid lines represent the corresponding theoretical trajectories. Neurons between layers are only weakly coupled (v D 1), so that the packet will fade out. Time is given in units of the membrane time constant t . (B) Same as in A but with increased coupling strength v D 2. There is an attractive xed point of the ow eld at a D 0.80 and s D 2.9 that corresponds to the stable waveform of the spike packet. (C) Similar plots as in A and B but with a strong coupling strength (v D 4). Initial stimulation is weak (a0 D 0.2) and broad (s0 D 2). As the packet propagates through a few layers, it quickly reaches a stable waveform with amplitude a D 0.98 and s D 1.5. 3 The R 1transfer function of Abeles can be retrieved if we replace our equation 1.3 by f (u) / # r (u0 ¡ u) du0 where the membrane potential density r (u0 ¡ u) is approximated by a gaussian with mean u and a variance s 2 ; cf. Abeles (1991, sections 4.5, 7.1–7.3).
996
Werner M. Kistler and Wulfram Gerstner
This seems to be a severe limitation for the computational usefulness of the system because in the latter case, even a single action potential ultimately can lead to a full-size pulse. Note, however, that this statement holds true only in the limit N ! 1. Due to the intrinsic probabilistic properties of our model, nite-size effects become important as soon as the activity An is no longer large as compared to N ¡1 . In a nite network, a few initial action potentials will lead to a full-size pulse only with a certain probability smaller than one depending on the size of the network and the distance from the bifurcation. Recent simulation studies (Diesmann et al., 1999) have conrmed that a slightly more general model with a nonlinear transfer function can indeed exhibit bistability where neurons are either ring asynchronously at a low rate or participating in the transmission of a sharp spike packet.
References Abeles, M. (1991). Corticonics. Cambridge: Cambridge University Press. Aertsen, A., Diesmann, M., & Gewaltig, M.-O. (1996). Propagation of synchronous spiking activity in feedforward neural networks. J. Physiol. Paris, 90, 243–247. Bressloff, P. C. (1999). Synaptically generated wave propagation in excitable neural media. Phys. Rev. Lett., 82, 2979–2982. Cox, D. R. (1962). Renewal theory. London: Methuen. Diesmann, M., Gewaltig, M.-O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. Ermentrout, B. (1998). The analysis of synaptically generated traveling waves. J. Comput. Neurosci., 5, 191–208. Gerstner, W. (2000). Population dynamics of spiking neurons: Fast transients, asynchronous states, and locking. Neural Comput., 12, 43–89. Gerstner, W., Ritz, R., & van Hemmen, J. L. (1993). Why spikes? Hebbian learning and retrieval of time-resolved excitation patterns. Biol. Cybern., 69, 503–515. Gerstner, W., & van Hemmen, J. L. (1992). Associative memory in a network of “spiking” neurons. Network, 3, 139–164. Gewaltig, M.-O. (2000). Evolution of synchronous spike volleysin cortical networks — Network simulations and continuous probabilistic models. Doctoral dissertation, Shaker Verlag, Aachen, Germany. Kistler, W. M. (2000). Stability properties of solitary waves and periodic wave trains in a two-dimensional network of spiking neurons. Phys. Rev. E, 62(6), 8834–8837. Kistler, W. M., Gerstner, W., & van Hemmen, J. L. (1997). Reduction of the Hodgkin-Huxley equations to a single-variable threshold model. Neural Comput., 9, 1015–1045. Kistler, W. M., Seitz, R., & van Hemmen, J. L. (1998). Modeling collective excitations in cortical tissue. Physica D, 114, 273–295. Lamperti, J. (1996). Probability: A survey of the mathematical theory. New York: Wiley.
Propagation of Activity Pulses
997
MarÏsa lek, P., Koch, C., & Maunsell, J. (1997). On the relationship between synaptic input and spike output jitter in individual neurons. Proc. Natl. Acad. Sci. USA, 94, 735–740. Plesser, H. E., & Gerstner, W. (2000). Noise in integrate-and-re neurons: From stochastic input to escape rates. Neural Comput., 12, 367–384. Received October 31, 2000; accepted July 30, 2001.
LETTER
Communicated by Peter Dayan
Population Coding and Decoding in a Neural Field: A Computational Study Si Wu
[email protected] RIKEN Brain Science Institute, Wako-shi, Saitama, Japan, and Department of Computer Science, Shefeld University, U.K. Shun-ichi Amari
[email protected] RIKEN Brain Science Institute, Wako-shi, Saitama, Japan Hiroyuki Nakahara
[email protected] RIKEN Brain Science Institute, Wako-shi, Saitama, Japan, and Japan Advanced Institute of Science and Technology, Nomi, Ishikawa, Japan This study uses a neural eld model to investigate computational aspects of population coding and decoding when the stimulus is a single variable. A general prototype model for the encoding process is proposed, in which neural responses are correlated, with strength specied by a gaussian function of their difference in preferred stimuli. Based on the model, we study the effect of correlation on the Fisher information, compare the performances of three decoding methods that differ in the amount of encoding information being used, and investigate the implementation of the three methods by using a recurrent network. This study not only rediscovers main results in existing literatures in a unied way, but also reveals important new features, especially when the neural correlation is strong. As the neural correlation of ring becomes larger, the Fisher information decreases drastically. We conrm that as the width of correlation increases, the Fisher information saturates and no longer increases in proportion to the number of neurons. p However, we prove that as the width increases further—wider than 2 times the effective width of the turning function—the Fisher information increases again, and it increases without limit in proportion to the number of neurons. Furthermore, we clarify the asymptotic efciency of the maximum likelihood inference (MLI) type of decoding methods for correlated neural signals. It shows that when the correlation covers a nonlocal range of population (excepting the uniform correlation and when the noise is extremely small), the MLI type of method, whose decoding error satises the Cauchy-type distribution, is not asymptotically efcient. This implies that the variance is no longer adequate to measure decoding accuracy. Neural Computation 14, 999–1026 (2002)
° c 2002 Massachusetts Institute of Technology
1000
Si Wu, Shun-ichi Amari, and Hiroyuki Nakahara
1 Introduction
Population coding is a method to represent stimuli by using the joint activities of a number of neurons. Experimental studies have revealed that this coding paradigm is widely used in the sensor and motor areas of the brain. For example, in the visual area MT, neurons are tuned to the moving direction (Maunsell & Van Essen, 1983). In response to an object moving in a particular direction, many neurons in MT re, with a noise-corrupted and bell-shaped activity pattern across the population. The moving direction of the object is retrieved from the population activity, to be immune from the uctuation existing in a single neuron’s signal. From the theoretical point of view, population coding, which concerns how information is represented in the brain, is prerequisite for more complex issues of the brain functions. It is also one of a few mathematically well-formulated problems in neuroscience, in the sense that it grasps the essential features of neural coding and yet, is simple enough for theoretic analysis. There has been extensive work to understand population coding theoretically (Abbott & Dayan, 1999; Brunel & Nadal, 1998; Georgopoulos, Kalaska, Caminiti, & Massey, 1982; Nakahara, Wu, & Amari, 2001; Paradiso, 1988; Pouget, Zhang, Deneve, & Latham, 1998; Salinas & Abbott, 1994; Seung & Sompolinsky, 1993; Wu, Nakahara, Murata, & Amari, 2000; Wu, Nakahara, & Amari, 2001; Yoon & Sompolinsky, 1999)—for example, how much information is included in population coding, what the optimal decoding accuracy is given an encoding model, how to construct a simple and yet accurate enough decoding strategy, how to implement a decoding method in a biological network, and what the effect of correlation is on decoding accuracy. These issues have been studied by using various models for the encoding process. In this article, based on a unied encoding model in a neural eld, we systematically study all the above computational aspects. This article is more than a review of the literature; it elucidates the published results in a more transparent way. Furthermore, we explore important new features concerning how the behavior of the Fisher information changes as the width of correlation increases. The efciencies of various decoding methods are also evaluated. To study population coding, a prototype model for the encoding process needs to be constructed. We consider a general case that neural activities are correlated, as seen in experimental data (Fetz, Yoyama, & Smith, 1991; Gawne & Richmond, 1993; Lee, Port, Kruse, & Georgopoulos, 1998; Zohary, Shadlen, & Newsome, 1994), under the assumption of gaussian additive noise. The uncorrelated case is handled as a special one when the correlation length is zero. The correlation model we consider is a continuous neural eld (Amari, 1977; Giese, 1999) in which neurons are pair-wise correlated with a strength distribution given by a gaussian function of their difference in preferred stimuli. Depending on the width of gaussian function, the correlation
Population Coding and Decoding in a Neural Field
1001
form varies to include noncorrelation, local correlation, short-range correlation, wide-range correlation, and uniform correlation. Compared with other prototype models in the literature (Abbott & Dayan, 1999; Johnson, 1980; Snippe & Koenderink, 1992; Yoon & Sompolinsky, 1999), this one shows the advantage of simplifying the calculation and provides a clear picture of the results. As we show, due to the property of the gaussian function and the continuous extension of the model, many calculations can be done more easily in the Fourier domain. It is also possible to apply the proposed method of neural elds to a more general case of ring-rate-dependent variances, including the multiplicative and Poisson-type noises, as treated in Wilke and Eurich (in press). Based on the proposed encoding model, we rst calculate the Fisher information. The inverse of Fisher information, the CramÂer-Rao bound, denes the optimal accuracy for an unbiased estimator to achieve. When no correlation exists, that is, the correlation width is zero, the Fisher information increases in proportion to the number of neurons or the neural density in the eld. As the width of correlation increases, the Fisher information decreases rapidly. Abbott and Dayan (1999) and Yoon and Sompolinsky (1999) found that if the neural correlation covers a nonlocal range of population, the Fisher information does not increase but saturates even when the number of neurons increases. The same behavior is observed in a simpler way in this article. We also show that as the range of correlationp becomes wider, that is, when the width of correlation becomes larger than 2 times the effective length of the tuning function, the Fisher information increases again in proportion to the number of neurons. This is an interesting new nding. Such a phenomenon is observed more generally in the case of multiplicative noise (Wilke & Eurich, in press), which we also conrmed by our method, although it is not described in this article. We then study three decoding methods and compare their performances. All of them are formulated in the maximum likelihood inference (MLI) type (including the conventional center of mass method), whereas they differ in the knowledge on the true encoding scheme. It turns out that a decoding method that keeps the knowledge of the tuning function but neglects the neural correlation is a good compromisebetween computational complexity and decoding accuracy, supporting the ndings in (Wu, Nakahara, et al., 2000; Wu et al., 2001). In previous work, we have pointed out that the MLI type of decoding method may not be asymptotically efcient for some strong correlation structures (Wu, Nakahara, et al., 2000; Wu et al., 2001). In this article, we prove that this is indeed the case when the correlation covers a nonlocal range of population (excepting the uniform correlation and when the noise is extremely small). Here, the estimated or decoded position of stimulus is no longer subject to the gaussian distribution, but is subject to the Cauchytype distribution in which the mean and variance diverge. In other words, the standard paradigm of statistical estimation does not hold. This is also a
1002
Si Wu, Shun-ichi Amari, and Hiroyuki Nakahara
new nding in population coding, and we discuss the consequence of this property. We also investigate the network implementation of the three decoding methods, following the idea of Pouget et al. (Deneve, Latham, & Pouget, 1999; Pouget & Zhang, 1997; Pouget et al., 1998). A recurrent neural eld is constructed in such a way that its steady state has a shape similar to the tuning function and is noise free. The peak position of the steady state gives the estimator of the stimulus (Amari, 1977; Giese, 1999). When there is no external input to the network, the system is neutrally stable on a line attractor. A small input will cause the state to drift to the position corresponding to the estimator in the three methods. Throughout the article, the effect of correlation on decoding accuracies is discussed as a by-product of calculations. In this study, we consider only that the stimulus is a single variable. Population coding also works in more complex cases, as studied by Eurich, Wilke, and Schwegler (2000), Treue, Hol, and Rauber (2000), Zemel, Dayan, and Pouget (1998), Zhang and Sejnowski (1999), and Zohary (1992). We need to study a high-dimensional neural eld. These issues are not in the scope of the work presented here. The article is organized as follows. In section 2, we introduce a discrete encoding model, which is extended to be the continuous version in section 3. In section 4, the Fisher information for the encoding model is calculated. In section 5, we compare three decoding methods. Their network implementations are discussed in section 6. Conclusions and a discussion are given in section 7. 2 Encoding Model
We begin with a discrete encoding model. Consider an ensemble of N neurons coding a variable x, which represents the position of the stimulus. Let us denote by ci the preferred stimulus position of the ith neuron, and let ri denote the response of the neuron, so that r D fri g, for i D 1, . . . , N, denote the population activity. The neural responses are correlated, and the ith neuron’s activity is given by r i D fi ( x ) C s 2 i ,
i D 1, . . . , N,
(2.1)
where fi ( x) is the tuning function of the ith neuron representing the mean value of the response when stimulus x is applied, and s2 i is noise. In this study, we consider only the gaussian tuning function, that is, fi ( x ) D p
1 2p a
2 2 e¡(ci ¡x) / 2a ,
where the parameter a is the tuning width.
(2.2)
Population Coding and Decoding in a Neural Field
1003
The parameter s represents the noise intensity, and 2 i is a gaussian random variable with mean 0 and variance 1. We decompose 2 i as 0 00 2 i D2 i C2 i,
(2.3)
where f2 i0 g are independent to all the others with zero mean and variance 1 ¡ b, while 2 i00 and 2 j00 are correlated. We assume the gaussian correlation, h2
00
i
where
denotes expectation over many trials. Then, the noise satises
h2 i i D 0,
(2.5)
h2 i2 j i D Aij ,
(2.6)
where the covariance matrix is given by 2 )2 Aij D (1 ¡ b )dij C be¡(ci ¡cj / 2b .
(2.7)
The parameter satises 0 · b · 1, and the width b is called the effective correlation length. The model captures the fact that the correlation strength between neurons decreases with the dissimilarity in their preferred stimuli |ci ¡ cj |. The encoding process of population coding is fully specied by the conditional probability density of r when stimulus x is given as Q ( r |x) D p
1 (2p s 2 ) N
det( A )
2
3 1 X ¡1 £ exp 4¡ 2 A ( ri ¡ fi ( x ) ) ( rj ¡ fj (x ) ) 5 . 2s ij ij
(2.8)
3 From Discrete to Continuous
Because we are interested only in the case when the number N of neurons is large (more accurately, the neuron density is large), it is useful to extend the discrete model to the continuous case. Mathematically, there are a lot of benets from coping with a continuous neural eld model (Amari, 1977; Giese, 1999). For example, as shown later, many operations can be done much more easily in the Fourier domain due to the continuous extension.
1004
Si Wu, Shun-ichi Amari, and Hiroyuki Nakahara
Let us consider a one-dimensional neural eld in which neurons are located with uniform density r. The activity of neuron at position c is denoted by r (c) . The neural response function r (c) is given by r ( c ) D f ( c ¡ x) C s 2 ( c ) ,
(3.1)
when stimulus x is applied, where quantities r (c) , f (c ¡ x) , and 2 ( c) are the counterparts of ri , fi , and 2 i in the discrete version. The tuning function f ( c ¡ x) has the same form as that of fi ( x ) except that ci is replaced by c. The noise term 2 ( c) satises h2 ( c) i D 0,
(3.2)
h2 ( c) 2 ( c ) i D h ( c, c ) / r , 0
0
2
(3.3)
where h ( c, c0 ) is the covariance function divided by the neuron density r 2 . We assume that the covariance function has the same form as the discrete case, 0 )2 2 h ( c, c0 ) D D1 (1 ¡ b )d (c ¡ c0 ) C D2 be¡(c¡c / 2b ,
(3.4)
where d ( c ¡ c0 ) is the delta function. In order to determine the coefcients D1 and D2 , we use the correspondence principle: The covariance matrix Aij and the correlation function h ( c, c0 ) correspond to each other to give the quadratic form for an arbitrary vector k D ( ki ) and its continuous version k (c ), Z
1 ¡1
Z
1 ¡1
k (c) h ( c, c0 ) k (c0 ) dc dc0 D
X
ki Aij kj ,
(3.5)
ij
Substituting equations 2.7 and 3.4 into 3.5, we get (see appendix A) 0)2 2 h ( c, c0 ) D r (1 ¡ b )d ( c ¡ c0 ) C r 2 be¡(c¡c / 2b .
(3.6)
Therefore, the continuous form of the encoding process is » Z 1Z 1 1 r2 [r(c) ¡ f ( c ¡ x )]h¡1 (c, c0 ) Q ( r |x) D exp ¡ 2 2s ¡1 ¡1 Z ¼ £[r(c0 ) ¡ f ( c0 ¡ x ) ] dc dc0 ,
(3.7)
where r D fr(c) g, and the parameter Z is the normalization factor. The function h ¡1 ( c, c0 ) is the inverse kernel of h ( c, c0 ) , satisfying Z
1 ¡1
h ¡1 (c, c0 ) h (c0 , c00 ) dc0 D d ( c ¡ c00 ) .
(3.8)
Population Coding and Decoding in a Neural Field
1005
The above correlation model contains a number of important forms depending on the correlation length: No correlation: When the correlation length b D 0, neurons are uncorrelated, with the covariance function h ( c, c0 ) D r (1 ¡ b )d (c ¡ c0 ) or the correlation matrix Aij D (1 ¡ b )dij . Local correlation: When the correlation length is of order 1 / r, that is, b D m /r for a xed m, neurons are correlated only within m neighboring neurons. When the density r is large, neurons are correlated only extremely locally. Short-range correlation:p When the correlation length is much longer than 1 / r but shorter than 2 times the width of the tuning function, that p is, 1 / r ¿ b < 2a, neurons are correlated over a short range. 1 p Wide-range correlation: When b ¸ 2a, neurons are correlated over a wide range. Uniform correlation: When the correlation length b ! 1, neurons are uniformly correlated with strength b, that is, h ( c, c0 ) D r (1 ¡ b )d ( c ¡ c0 ) C r 2 b or Aij D (1 ¡ b )dij C b. 4 The Fisher Information
The Fisher information is a useful measure in the study of population coding. By knowing the Fisher information, we have an idea of the minimum error one may achieve. The Fisher information for the encoding model Q ( r |x) is dened as Z Q ( r |x)
IF ( x ) D ¡
d2 ln Q (r |x) dr. dx2
(4.1)
From equation 3.7, we get r2 IF ( x ) D 2 s
Z
1
Z
¡1
1 ¡1
f 0 ( c ¡ x ) h ¡1 ( c, c0 ) f 0 (c0 ¡ x ) dc dc0 ,
(4.2)
where f 0 ( c ¡x) D df ( c ¡x) / dx. Note that IF ( x ) does not depend on x because of the homogeneity of the eld. The Fourier transform of a function g ( t) is dened as 1 F [g(t) ] D p 2p
Z
1 ¡1
e ¡ivt g ( t) dt.
(4.3)
1 We consider only the case that a À 1 / r, as suggested by experiments, that is, neurons are widely tuned by stimulus.
1006
Si Wu, Shun-ichi Amari, and Hiroyuki Nakahara
By using the Fourier transformation, equation 4.2 becomes IF (x ) D
r2 2p s 2
Z
1 ¡1
v2 F (v) 2 dw, H (v)
(4.4)
where F (v) D F [ f (c ¡x) ] and H (v) D F [h(c ¡c0 ) ]. Here, we use the relations F [ f 0 ( c ¡ x) ] D ivF (v) and F [h ¡1 ( c, c0 ) ] D 1 / H (v) . From equations 2.2 and 3.6, F (v) D e¡a
2 v2
/ 2,
(4.5)
H (v) D r (1 ¡ b ) C r
p 2
2p bbe
¡b2 v2
/2 ,
(4.6)
where we put x D 0 without loss of generality. Therefore, IF (x ) D
r2 2p s 2
Z
1 ¡1
v2 e¡a v p dv. r (1 ¡ b ) C r 2 2p bbe¡b2 v2 / 2 2
2
(4.7)
Let us analyze the property of Fisher information in various forms: No correlation: The Fourier transform of the covariance function is p H (v) D r (1 ¡ b ) , and the Fisher information IF (x ) D r / [4 p a3 s 2 (1 ¡ b ) ] increases in proportion to the neuron density r. Since each neuron carries independent information for x and the total Fisher information is their sum, IF increases in proportion to r. p 2 2 2 Local correlation: H (v) D r (1 ¡ b C 2p mbe¡m v / 2r ) is of order r. In this case, the total Fisher information IF is not a simple sum of component neurons, but correlations disappear so quickly that the total information is still in proportion to r. Short-range correlation: For large r, H (v) is the order of r 2 , so that we have p 2 2 H (v) ¼ r 2 2p bbe ¡b v / 2 .
(4.8)
Hence, IF is written as IF ¼
1 (2p ) 3 / 2 bbs 2
Z v2 e (¡a
b / 2)v2
2C 2
dv.
(4.9)
p Since b < 2a, the integral converges to a constant. Hence, IF is of nite value even when r goes to innity. This result agrees with Abbott and Dayan (1999) and Yoon and Sompolinsky (1999) under the additive noise distribution.
Population Coding and Decoding in a Neural Field
1007
p Wide-range correlation: When b ¸ 2a, the integral, equation 4.9, diverges, and we cannot use this approximation. If we evaluate the integral more accurately by taking the term of 1 /r into account, we have 1 IF D 2p s 2 r D 2p s 2
Z Z
v2 e ¡a v p dv (1 /r ) (1 ¡ b ) C 2p bbe ¡b2 v2 / 2 2
(1 ¡ b ) ea2 v2
2
v2 p dv. 2 2 2 C 2p rbbe ( a ¡b / 2)v
(4.10)
Hence, IF increases with r for large r (see Figure 2d).
p Uniform correlation: H (v) D r (1 ¡ b ) , and IF ( x ) D r / [4 p a3 s 2 (1 ¡ b ) ] is proportional to r, which has the same value as in the uncorrelated case. 2 In this case, the noise is decomposed as (Wu, Nakahara, et al., 2000; Wu et al., 2001) 2 ( c) D 2
2
0
( c) C 2 00 ,
(4.11)
where 2 0 ( c) is independent noise, h2 0 ( c) 2 0 ( c0 ) i D (1 ¡ b )d (c ¡ c0 ) /r 2 , and 00 is common to all neurons, h(2 00 ) 2 i D b / r 2 . The term 2 00 increases all r (c) by a common factor s 2 00 , which does not have any effect for decoding x.
Special case with b D 1: In order to make the p situation clear, we consider this special case. In this case, when b < 2a, the Fisher information is constant not depending on r, aspis similar to the case of short-range correlation. However, when b ¸ 2a, the Fisher information diverges to 1 as is seen from equation 4.7. This implies that x can be decoded accurately. In this case, the noise 2 ( c) is so strongly correlated among neurons in an interval of b that 2 ( c) is regarded as constant in this range. Since the range of width b covers the effective range a of the tuning function, the noise shifts the tuning function to r (c) D f ( x ¡ c) C 2 , so that we can decode x from r ( c) as if there were no noise. Figure 1 shows how the Fisher information behaves as the correlation width changes (with xed neural density). It rst decreases drastically when the correlation width is small and then increases again when the width is large. Figure 2 shows the asymptotic behavior of the Fisher information in different correlation cases, as the density r increases. When a multiplicative noise model is used, the Fisher information always increases with the neural density r (Wilke & Eurich, in press). This is im2 This conclusion is different from that in Abbott and Dayan (1999), where the uncorrelated case is dened as Aij D dij (by setting b D 0) instead of Aij D (1 ¡ b )dij used here (by setting b D 0).
1008
Si Wu, Shun-ichi Amari, and Hiroyuki Nakahara
60
r =50 r =100 r=200
40
IF 20
0
0
2
4
6
8
10
b Figure 1: The Fisher information changes with the correlation width. The parameters are a D 1, s D 1, and b D 0.5.
portant because the multiplicative noise is believed to be more biologically plausible. Our method can be extended in such a case. 5 Population Decoding
The Fisher information tells us only the optimal decoding accuracy for an unbiased estimator to achieve. When a practical decoding method is concerned, its performance needs to be evaluated individually depending on the decoding model. We compare three decoding methods in this study. All are formulated as the MLI type, that is, the maximizer of a likelihood function, whereas they differ in the probability models used for decoding. 5.1 Three Decoding Methods. An MLI type estimator xO is obtained through maximization of the presumed log likelihood ln P ( r |x), that is, by solving
r ln P ( r | xO ) D 0,
(5.1)
Population Coding and Decoding in a Neural Field 300
1009
100 80
b=0 b=100
IF 200
b=1/ r b=2/ r
I F 60 40
100
20 0
0
200
400
600
r
800 1000
0
0
200
400
(a)
800 1000
(b)
0.32
50
0.30
40 b=0.8 b=100
I F 0.28
20
0.24
10 0
200
400
r
600
(c)
b=2 b=2.5
IF 30
0.26
0.22
600
r
800 1000
0
0
200
400
600
r
800 1000
(d)
Figure 2: The asymptotic behavior of Fisher information in different correlation cases. The parameters are a D 1, s D 1, and b D 0.5. (a) No correlation (b D 0) and uniform correlation (b D 100, approximated as innity). The two curves coincide. (b) Limited-range correlations. Parameters are b D 1 /r and b D 2 /r, respectively. (c) Short-range correlations. b D 0.8 and 1. (d) Wide-range correlations. b D 2 and 2.5.
where r k ( x) denotes dk ( x) / dx. P ( r |x) is called the decoding model, which can be different from the real encoding model Q ( r |x) . This is because the decoding system usually does not know the exact information of the encoding system. Moreover, a simple and robust decoding model is computationally desirable. We consider three decoding models, dened as follows: 1. The conventional MLI, referred to as FMLI (MLI based on the faithful model), uses all of the encoding information, that is, the decoding model is the true encoding model, PF ( r |x) D Q ( r |x) .
(5.2)
2. The UMLI method (MLI based on an unfaithful model) (Wu, Nakahara et al., 2000) uses the information on the shape of the tuning function
1010
Si Wu, Shun-ichi Amari, and Hiroyuki Nakahara
but neglects the neural correlation, so that the probability density PU ( r |x) D
» ¼ Z 1 1 r [r(c) ¡ f ( c ¡ x ) ]2 dc , exp ¡ 2 2s ¡1 ZU
(5.3)
is used for decoding. 3. The center of mass (COM) method does not use any information of the encoding process; instead, it assumes an incorrect but simple tuning function. It also disregards correlations by using PC ( r |x) D
» ¼ Z 1 1 r [r(c) ¡ fQ(c ¡ x) ]2 dc , exp ¡ 2 2s ¡1 ZC
(5.4)
where fQ(c ¡ x ) D ¡(x ¡ c) 2 C const is used as a presumed tuning function.3 It is easy to check that the third method is equivalent to the conventional COM decoding strategy (Baldi & Heiligenberg, 1988; Georgopoulos et al., 1982; Wu, Nakahara et al., 2000), 4 with the solution given by R1
¡1 xO D R 1
cr ( c) dc
¡1 r
( c) dc
.
(5.5)
We should point out that the use of an unfaithful model (e.g., UMLI or COM) has important meaning. When experimental scientists reconstruct the stimulus from the recorded data, they in fact use an unfaithful model, since the real encoding process is never known. Furthermore, the real neural correlation is often complex and may constantly change over time, so that it is hard, even if possible, for the brain to store and use all this information. MLI based on an unfaithful model (neglecting some part of information) is a key to solving this information curse. 5.2 Performance of Decoding and Asymptotic Efciency. Since the three methods are of the same type, their decoding errors can be calculated in similar ways. We show only the derivation of the decoding error for UMLI. For FMLI, see appendix B. 3 In the strict sense, fQ(c ¡ x) cannot be a tuning function, since its value diverges when (x ¡ c) goes to innity. However, this does not matter in practice, since (x ¡ c) is always restricted to a nite region when COM is practically used. See Snippe (1996), Wu, Nakahara, et al. (2000), and Wu et al. (2001). 4 This explains why COM performs comparably well to MLI when neurons are uncorrelated and the cosine tuning function is used (Salinas & Abbott, 1994), since the cosine function, for example, cos[(x ¡ c) / T], has an approximated quadratic form when (x ¡ c) is restricted in a small region.
Population Coding and Decoding in a Neural Field
1011
For convenience, two notations are introduced: EQ [k( r, x )] and VQ [k( r, x) ] denote, respectively, the mean and variance of k (r , x ) with respect to the distribution Q ( r |x) . Suppose xO is close enough to x. We expand r ln PU ( r | xO ) at x,
r ln PU ( r | xO ) ’ r ln PU (r |x) C rr ln PU (r |x) ( xO ¡ x ).
(5.6)
Since the estimator xO satises r ln PU ( r | xO ) D 0, 1 1 rr ln PU ( r |x) ( xO ¡ x) ’ ¡ r ln PU (r |x). r r
(5.7)
We put Z 1 1 1 [r(c) ¡ f (c ¡ x) ] f 0 (c ¡ x ) dc r lnPU (r |x) D 2 r s ¡1 Z 1 1 0 D 2 2 ( c ) f ( c ¡ x ) dc, s ¡1
RD
Z 1 1 1 ( [r(c) ¡ f (c ¡ x) ] f 00 (c ¡ x) dc S D rr lnPU r |x) D 2 r s ¡1 Z 1 1 C [ f 0 ( c ¡ x )]2 dc s 2 ¡1 Z 1 1 00 D 2 2 ( c ) f ( c ¡ x) dc C D, s ¡1
(5.8)
(5.9)
where 1 D D p 3 2. 4 pa s
(5.10)
Then the estimating equation is S (xO ¡ x ) D ¡R
(5.11)
R xO ¡ x D ¡ . S
(5.12)
or
Here, both R and S are random variables depending on 2 ( c). It is easy to show that EQ [R] D 0, EQ [S] D
1 p . 4 p a3 s 2
(5.13) (5.14)
1012
Si Wu, Shun-ichi Amari, and Hiroyuki Nakahara
Their variances are given, by using the Fourier transforms, as
VQ [R] D
1 2pr 2 s 2
VQ [S] D
1 2pr 2 s 2
Z Z
1
v2 F (v) 2 H (v) dv,
(5.15)
v4 F (v) 2 H (v) dv.
(5.16)
¡1 1
¡1
These show that R is a zero-mean gaussian random variable. The random variable S is composed of two terms, the rst one being random and the second term D being a xed constant (see the right-hand side of equation 5.9). Remark. The above procedure is the standard way to analyze the asymptotic
error of estimation. In the standard statistical model with repeated independent observations that is, the independent and identically distributed (i.i.d.) case, R is gaussian, while the constant term D dominates over the random one because of the law of large numbers. In particular, when the faithful model is used (FMLI), D is the Fisher information, and so is VQ [R]. Therefore, the asymptotic error is given by 1 / N times of the inverse of Fisher information, where N is the number of observations. This shows that the Crame r-Rao bound is asymptotically attained. Such an estimator is said to be asymptotically efcient or Fisher efcient. When an unfaithful model is applied (still to the i.i.d. case), VQ [R] and D are different. Hence, the Fisher efciency is not attained in general (Akahira & Takeuchi, 1981; Murata, Yoshizawa, & Amari, 1994). However, the asymptotic gaussianity of the estimator and the 1 / N convergence property of the error are guaranteed. In this article, we say that such an estimator is quasi-asymptotic efcient or quasi-Fisher efcient. Apart from the above two standard cases, population coding includes the third one (because of correlation), where the estimator is not subject to the gaussian distribution, although the asymptotic error may be small (Wu et al., 2001). Such an unusual case occurs when the random term S is not negligible, as compared to the constant D. In such a situation, the estimator is represented by a ratio of two gaussian random variables, so it is subject to the Cauchy-type distribution. The 1 / N convergence of error does not hold either. Such an estimator is said to be non-Fisherian, since the CramÂer-Rao paradigm using the Fisher information does not hold. This is a new fact in the population coding literature. Let us return to calculating the UMLI decoding error. From equations 5.10 and 5.16, we see that the constant term in S dominates over the random one
Population Coding and Decoding in a Neural Field
1013
in the two cases: H (v) is of order r. In this case, the random term is of O ( r1 ) , and D is of O (1) . H (v) is of order r 2 , but the noise variance s 2 is sufciently small. In this case, the random one is O ( s1 ) , and the constant term D is O ( s12 ) . The rst case corresponds to uncorrelated, local range, and uniformly correlated cases. The second case holds for the short- and wide-range correlations with small noise.5 In these cases, we may neglect the random term in R, so that we have asymptotically S ¼ D, xO ¡ x ¼
(5.17)
R , D
(5.18)
which is normally distributed with zero mean and variance,
EQ [ (xO ¡ x ) 2 ] D
8a6 s 2 r2
Z
1 ¡1
v2 F (v) 2 H (v) dv.
(5.19)
The above equation holds asymptotically when r ! 1 and H (v) is of order r, or s ! 0. The cases of short- and wide-range correlations and the noise are strong. In these cases, since H (v) is of order r 2 , the random and constant terms in the variable S are of the same order. It is rather difcult to analyze the O Equation 5.12 shows that x¡x O behavior of x. is a ratio of two gaussian random variables, so that its distribution is of the Cauchy type, which means that UMLI is not Fisherian. We may note that the neural eld is nite practically; the variance of decoding error does not diverge even in the Cauchy case. But the variance is no longer adequate to measure the decoding accuracy. A similar analysis is applicable to FMLI (see appendix B) and COM. It turns out that they have the same asymptotic behaviors. Table 1 summarizes the asymptotic behaviors of the three decoding methods (the special case of weak noise, in which the MLI type of methods is always approximately asymptotically or quasi-asymptotically efcient, is not included) and the Fisher information in different correlation cases.
5 Note that the weak noise assumption is used in Deneve et al. (1999) and Pouget et al. (1998) to get their results.
1014
Si Wu, Shun-ichi Amari, and Hiroyuki Nakahara
Table 1: The Asymptotic Behaviors of Fisher Information and Three MLI Type of Decoding Methods.
No correlation Local correlation Short-range correlation Wide-range correlation Uniform correlation
IF
FMLI
UMLI
COM
/r /r Saturating /r /r
FE FE Non-F Non-F FE
QFE QFE Non-F Non-F QFE
QFE QFE Non-F Non-F QFE
Notes: The special case of weak noise is exluded. FE = Fisher efciency, QFE = quasi-Fisher efciency; Non-F = non-Fisherian.
Table 2: Comparing the Decoding Errors of FMLI, UMLI, and COM When H (v) is Order of r.
FMLI UMLI COM
b !1 p 3 2 4 p a s (1 ¡ b ) /r p 4 p a3 s 2 (1 ¡ b ) /r 18a3 s 2 (1 ¡ b ) / r
bD0 p 3 2 4 p a s (1 ¡ b ) / r p 4 p a3 s 2 (1 ¡ b ) / r 18a3 s 2 (1 ¡ b ) /r
b D m/r p p 3 2 4 p a s [1 C (p2p m ¡ 1)b] /r p 3 2 ( 2p m ¡ 1)b] /r 4 p a s [1 C p 18a3 s 2 [1 C ( 2p m ¡ 1)b] / r
When FMLI and COM are asymptotically efcient, their decoding errors are calculated to be E Q [( xO ¡ x ) 2FMLI ] »
r2
EQ [(xO ¡ x ) 2COM ] »
s2 r2
2p s 2
R1
v2 F (v) 2
Z
Z
¡1
1
¡1
1
/ H (v) dv
D
ch (c, c0 ) c0 dc dc0 ,
1 , IF
(5.20)
(5.21)
¡1
respectively. 5.2.1 Performance Comparison. We compare the performances of three decoding methods. Table 2 lists the decoding errors of three decoding methods when H (v) is the order of r.6 We see that UMLI and FMLI have comparable performances. Both are much better than COM. Figure 3 compares the decoding errors of FMLI, UMLI and COM in the case of nonlocal range correlation and weak noise (note that FMLI, UMLI, and COM are asymptotically or quasi-asymptotically efcient in this case 6
Two conditions are used to obtain the results in Table 2. (1) To calculate the decoding error of COM, we need to restrict the preferred stimulus c within a nite range [¡L, L]; otherwise, the error diverges. This restriction means sampling only those neurons that are active enough. This approximation benets only COM and does not affect the results of FMLI and UMLI much (see Wu, Nakahara et al., 2000; Wu et al., 2001). In this article, we choose L D 3a. (2) To get the results for the case of b D m /r, we use the condition a À 1 / r.
Population Coding and Decoding in a Neural Field 8
20
Decoding Error
Decoding Error
7
FMLI UMLI COM
6 5 4 3
1015
0
200
400
600
800
1000
FML UML COM
15
10
5
0
0
2
4
6
r
b
(a)
(b)
8
10
Figure 3: Comparing the decoding errors of FMLI, UMLI, and COM in the case of strong correlation and weak noise. The unit of decoding errors in the gure is s 2 . Parameters a D 1 and b D 0.5. (a) Decoding errors change with the neuron density r. Parameter b D 1. (b) Decoding errors change with the correlation length b. Neural density r D 50.
because of weak noise.). It shows that UMLI has a larger error than FMLI and a lower error than COM. When the correlation covers the short range of the population, the decoding errors of three methods saturate as the neural density increases (see Figure 3a). This is similar to the behavior of Fisher information. For a xed neural density r, the decoding errors of the three methods rst increase with the correlation length and then decrease subsequently (see Figure 3b). This is understandable, since the extremes on both ends correspond to the cases when neurons are either uncorrelated or uniformly correlated. The computational complexity of the three methods can be roughly compared as follows. Consider maximization of the log likelihood of UMLI and FMLI by using the standard gradient-descent method. The amounts of computation for obtaining the derivative of the log likelihood of UMLI and FMLI are proportional to N and N 2 , respectively. UMLI is signicantly simpler than FMLI when N is large. For COM, due to the quadratic form of the tuning function, the estimation can be done in one shot by equation 5.5. Therefore, UMLI is a good compromise between decoding accuracy and computational complexity.
6 Network Implementation
So far, we have been concerned only with the accuracy and simplicity of decoding methods. To be realistic, it is essential for the strategy to be bi-
1016
Si Wu, Shun-ichi Amari, and Hiroyuki Nakahara
ologically achievable. We investigate implementing the three methods by using a recurrent network, following the ideas in Deneve et al. (1999), Pouget and Zhang (1997), and Pouget et al. (1998). UMLI is studied rst. Let us consider a fully connected one-dimensional homogeneous neural eld, in which c denotes the position coordinate. Let Uc denote the (average) internal state of neurons at c and Wc,c0 the recurrent connection weights from neurons at c to those at c0 . We propose that the dynamics of neural excitation is governed by dU c D ¡Uc C dt
Z Wc,c0 Oc0 dc0 C Ic ,
(6.1)
where
Oc D
Uc2 R . 1 C m Uc2 dc
(6.2)
This Oc is the activity of the neurons at c, and Ic is the external input arriving at c. The recurrent interaction is assumed to be 0 )2 2 W c,c0 D e ¡(c¡c / 2a .
(6.3)
We rst look at the network dynamics, when there is no external input— Ic D 0. It is not difcult to check that the network has a one-parameter family of nontrivial steady states, 2 2 OQ c D Ae¡(c¡z) / 2a ,
(6.4)
with the corresponding internal state, 2 2 UQ c D Be¡(c¡z) / 4a ,
(6.5)
including z as a free parameter, where the coefcients A and B are easily determined. The parameter z denotes the peak position of the population activity comprised by fOQ c g. Note that the stable state has a similar shape to the tuning function (see equation 2.2) in this case. Equation 6.4 or 6.5, parameterized by z, denes a line attractor for the network, on which the system is neutrally stable. This is due to the translation invariance of the network interactions (Amari, 1977; Deneve et al., 1999; Giese, 1999; Seung, 1996; Zhang, 1996).
Population Coding and Decoding in a Neural Field
1017
The decoded estimator xO is given by the peak position of the nal population activity Oc , that is, the value of z. In order to achieve this goal, one 1 2 considers that the external input is transient, that is, Ic » rc / d ( t) . This stimulates the network whose initial state is set equal to the original noisy version, Oc (0) D rc (Pouget et al., 1998; Deneve et al., 1999). After relaxation, the system reaches the desired state, in which the noises are smoothed out and the O However, if Ic disappears, the attractor of the eld is peak position gives x. only neutrally stable, so that the peak position may uctuates. Hence, we consider that a small input Ic D 2 rc persists after the initial state is set. 7 We further assume that the input is sufciently small (2 is small enough), such that the change it brings to the form of the stable state (see equation 6.4) is negligible. Instead of being trapped into complicated mathematical calculation, we adopt an approximate but simple way to understand the above network dynamics, being backed up by simulation results. Since the steady state of the network is assumed always to be on the line attractor and its position is determined by the input, independent of the initial value, we can, without risk of losing any information, see that the network is, from the beginning on the line attractor, at a random position, evolving along it, until it reaches a stable position. This modied dynamic picture simplies the relationship between Oc and Uc as Oc D DU2c ,
(6.6)
where equations 6.4 and 6.5 are used, and D D B2 / A. For the dynamics specied by equations 6.1 and 6.6, there exists a Lyapunov function (Cohen & Grossberg, 1983), Z Z 1 1 1 Wc,c0 Oc Oc0 dc dc0 2 ¡1 ¡1 Z 1 Z Uc Z 1 C zg0 ( z) dz dc ¡ 2 rc Oc dc,
LD ¡
¡1
0
(6.7)
¡1
with the function g ( z) D Dz2 . We consider a two-step perturbation procedure to minimize the Lyapunov function. Minimizing the rst two terms of equation 6.7, which are
7
This mechanism was proposed in Amari (1977) and Pouget and Zhang (1997).
1018
Si Wu, Shun-ichi Amari, and Hiroyuki Nakahara
0.6 tuning function initial state steady state
0.5 0.4 0.3 0.2 0.1 0.0 0.1 0.2
3
1 1 Preferred Stimulus
3
(a) Figure 4: Performance of the recurrent network. The parameters are a D 1, m D 0.5, s 2 D 0.01, and b D 0.5. (a) The typical states of network before and after relaxation. The steady state is scaled to match the tuning function. (b) Comparing the decoding errors of the network estimation and UMLI. The results are obtained after averaging over 100 trials.
of order 1, determines that the stable state is on the line attractor. Minimizing the third term of order of 2 determines the peak positions of the stable state. A justication of this two-step perturbation procedure is given in appendix C. This two-step optimization is equivalent to Z Minfzg ¡
1 ¡1
rc Oc dc,
2 2 subject to Oc D Ae¡(c¡z) / 2a .
(6.8)
Recall that the solution of UMLI is given by Max fxg ln PU ( r |x) , Z 1 r D 2 rc f (c ¡ x) dc C terms not on x. s ¡1
(6.9)
Comparing equations 6.8 with 6.9 and 2.2, we see that the nal state of the network gives the same estimator as that of UMLI. The above result is conrmed by the simulation experiment (see Figure 4), which was done with 101 neurons uniformly distributed in the region [¡3, 3], with the true stimulus having the value of zero. Figure 4a shows the
Population Coding and Decoding in a Neural Field
1019
typical behaviors of the recurrent network before and after relaxation. The steady state of the network becomes smooth after the noise in the initial state is cleaned out. To see the coincidence between theptwo methods, we measure O Q [Oz], where xO and a quantity t, dened as t D < (xO ¡ zO ) (xO ¡ zO ) > / VQ [x]V O and zO denote the estimations of UMLI and the network, respectively. VQ [x] VQ [Oz] are their variances, and < > represents averaging over many trials. This quantity measures the statistic difference between the two estimators. The smaller the value of t is, the more the two methods agree with each other. In the extreme case of xO D zO for each trial, the value of t is zero. If the two estimators are completely independent of each other and have zero means, as in this example, the value of t is 2. Figure 4b shows that t is quite small in all correlation cases (the largest value is 0.032), which means that the network estimation can be regarded as the same as that of UMLI. In a similar way, we can show that FMLI can be implemented by the same recurrent neural eld (since both methods use the same tuning function to R ¡1 t the data) but by using a different external input: Ic D 2 hc,c0 rc0 dc0 (see appendix D). This result is different from that in Deneve et al. (1999), where FMLI and UMLI are implemented by using different recurrent interactions. To implement COM, the external input is the same as that in UMLI (since both methods discard the correlation). However, a different form of network interaction is needed to ensure that the line attractor has the same shape of the corresponding tuning function. It is interesting to compare the complexity of three methods in terms of network implementation. Obviously, FMLI is more complicated than UMLI, since it uses neural correlation. However, COM and UMLI are around the same level.
7 Conclusion and Discussion
We have proposed a new unied encoding eld model for population coding, in which neural responses are correlated with a gaussian correlation function. This model serves as a good prototype for theoretical study. It has the advantages of simplifying the calculation and providing a clear picture of the results, which are often vague when other models are used. Based on the proposed model, we calculate the Fisher information and elucidate its asymptotic behaviors for various correlation lengths. We conrm that as the correlation covers the short range of population, the Fisher information saturates and does not increase any more in proportion to the number of neurons. Moreover, we prove that as the correlation covers the wide range of the population, the Fisher information increases again without limit in proportion to the number of neurons. This nding, together with others, needs experimental conrmation on the range of neural correlation in the brain.
1020
Si Wu, Shun-ichi Amari, and Hiroyuki Nakahara
Three decoding methods are compared in this study. All are formulated as the MLI type, including the conventional COM method, whereas they differ in the knowledge of encoding process being used. It turns out that UMLI, which uses the shape of the tuning function but neglects the neural correlation, stands out due to the good balance between computational complexity and decoding accuracy. Furthermore, we investigate the network implementation of the above three methods. It shows that UMLI and FMLI can be achieved by the same recurrent network with different external inputs. We also clarify the asymptotic efciency (Fisher efciency) of the MLI type of decoding methods for correlated signals. It is proved that when the neural correlation covers a nonlocal range of population (excluding the cases of uniform correlation and weak noise), the MLI type of method is non-Fisherian. The CramÂer-Rao bound is not achievable in this case, and hence one should be careful carrying out analysis based on the Fisher information. It is also important to be aware of the quasi-asymptotic efciency of MLI based on unfaithful models. Only when the quasi-asymptotic efciency is ensured can one calculate the decoding errors of UMLI and COM by equations 5.19 and 5.21, respectively. Otherwise, the decoding errors are subject to the Cauchy type of distribution and are difcult to quantify. An interesting case that is not studied in this article is the multiplicative correlation, in which the correlation strength depends on ring rates (Abbott & Dayan, 1999; Nakahara & Amari, in press; Wilke & Eurich, in press). To cope with this situation, we can, for example, extend the correlation matrix Aij in equation 2.7 to be a new one A0ij D fia ( x) Aij fja ( x) . In the case of a D 1 and b ! 1, it returns to A0ij D fi ( x )[(1 ¡ b )dij C b] fj (x ) , which is the case studied by Abbott and Dayan (1999) and Wu et al. (2000). The continuous version of A0ij in a neural eld is h0 (c, c0 , x) D f a ( x ¡ c) h ( c, c0 ) f a ( x ¡ c0 ) , where h ( c, c0 ) is given by equation 3.6, and its inverse is (h0 ) ¡1 ( c, c0 , x ) D f ¡a ( x ¡ c) h ¡1 ( c, c0 ) f ¡a ( x ¡ c0 ) . We can calculate the Fisher information and the performance of the MLI type of decoding methods much as we have done in this article. The results will be reported in a future publication. Finally, we should point out that a nonlocal range correlation does not imply the failure of MLI-type methods. In addition to the counter-example of uniform correlation, another one is the multiplicative correlation, for example, Aij D [dij C c (1 ¡ dij ) ] fi ( x ) fj ( x) . It has been proved that the MLItype method is asymptotically efcient in this case, although the correlation covers the whole range of the population (Wu, Chen, & Amari, 2000). There is an intuitive way to understand the reason (similarly for the uniform correlation). The idea is to decompose the uctuations of neural responses into two parts, ri ¡ fi ( x ) D s fi ( x) (c C 2 i ) , where c and f2 i g, for i D 1, . . . , N, are independent random variables having zero mean and variance c and 1 ¡ c, respectively. Neurons are correlated through the common factor c ,
Population Coding and Decoding in a Neural Field
1021
P P which can be calculated by c D i [ri ¡ fi (x ) ] / [s i fi ( x) ]. This information can be used when MLI is performed. Appendix A: The Covariance Function of the Continuous Encoding Model
Without loss of generality, we consider that the preferred stimulus ci is uniformly distributed in a range [¡ L2 , L2 ], that is, ci D ¡
L L Ci , 2 N
for
i D 1, . . . , N.
(A.1)
By choosing a particular form of fki g, e.g., ki D 1, for i D 1, . . . , N, equation 3.3 becomes Z
L 2
Z
¡ L2
L 2
¡ L2
h (c, c0 ) dc dc0 D
X
Aij ,
(A.2)
ij
which has an intuitive meaning; the total correlation is reserved after the continuous extension. From equation 3.2, Z
L 2
Z
¡ L2
L 2
¡ L2
p h (c, c0 ) dc dc0 D D1 (1 ¡ b ) L C D2 b 2p bL.
(A.3)
From equation 2.6, X ij
p Aij D N (1 ¡ b ) C N2 b 2p b / L,
(A.4)
in the large N limit. Combining equations A.3 and A.4, we get 0 2 2 h ( c, c0 ) D r (1 ¡ b )d ( c ¡ c0 ) C r 2 be¡(c¡c ) / 2b ,
where r D
N L
(A.5)
is the neuron density.
Appendix B: The Performance of FMLI
The performance of FMLI can be similarly analyzed as done for UMLI. Expanding r lnQ( r | xO ) at x and using the condition r lnQ( r |x) D 0, we obtain an estimating equation, xO ¡ x D ¡
RF , SF
(B.1)
1022
Si Wu, Shun-ichi Amari, and Hiroyuki Nakahara
where the variables RF and SF are dened as 1 r lnQ(r |x) r2 Z 1Z 1 1 0 ¡1 0 0 0 D 2 2 ( c ) h ( c, c ) f ( c ¡ x ) dc dc , s ¡1 ¡1
RF D
1 rr lnQ(r |x) r2 Z 1Z 1 1 0 ¡1 0 00 0 D 2 2 ( c ) h ( c, c ) f ( c ¡ x) dc dc C DF , s ¡1 ¡1
(B.2)
SF D
(B.3)
where DF D
1 s2
Z
1 ¡1
Z
1 ¡1
f 0 (c ¡ x) h ¡1 ( c, c0 ) f 0 (c0 ¡ x) dc dc0 .
(B.4)
It is easy to check that the mean values of RF and SF are zero and DF , respectively. Their variances and the value of DF are given by using the Fourier transforms as VQ [RF ] D
1 2pr 2 s 2
Z
1 ¡1
v2 F (v) 2 / H (v) dv,
Z 1 1 v4 F (v) 2 / H (v) dv, 2pr 2 s 2 ¡1 Z 1 1 v2 F (v) 2 / H (v) dv. DF D 2p s 2 ¡1
VQ [SF ] D
(B.5) (B.6) (B.7)
These show that RF is a zero-mean gaussian variable. The random variable SF is composed of two terms, the rst one being random and the second one, DF , being a constant. The constant term dominates over the random one in the two cases: H (v) is of order r. In this case, the random term is O ( r 13/ 2 ) , and the constant one is O ( r1 ) . H (v) is of order r 2 , but the noise variance s 2 is sufciently small. In this case, the random one is O ( s1 ), and the constant term is O ( s12 ) . The rst case corresponds to uncorrelated, local range, and uniformly correlated cases. The second case holds for the short- and wide-range correlations with small noise. In these cases, we may neglect the random term in SF , so
Population Coding and Decoding in a Neural Field
1023
that we have asymptotically SF ¼ D F , xO ¡ x ¼
RF , DF
(B.8)
which is normally distributed with zero mean and variance, EQ [ (xO ¡ x ) 2FMLI ] »
r2
2p s 2 . 2 (v) 2 H (v) dv / ¡1 v F
R1
(B.9)
In the cases of short- and wide-range correlations and when the noise is strong, the random and constant terms in SF are of the same order. Equation B.1 shows that the distribution of xO ¡ x is of the Cauchy type. FMLI is not asymptotically efcient in this case. Appendix C: Minimizing the Lyapunov Function
When 2 D 0, the solution of minimizing the Lyapunov function of the form 6.7 (only the rst terms are concerned) is given by 2 2 Uc D UQ c D Be ¡(c¡z) / 4a ,
2 2 Oc D OQ c D Ae¡(c¡z) / 2s ,
(C.1)
for any value of z. When a small input 2 rc (2 ! 0) is added, the solution will be uniquely determined and has a form slightly deviated from equation C.1, which can be approximated as (in the rst order of 2 ) Uc ¼ UQ c C 2 Ec ,
Oc ¼ OQ c C 2 (2DUQ c Ec ) ,
(C.2)
where 2 Ec and 2 (2DUQ c Ec ) denote deviations from the linear attractor. Substituting equation C.2 into 6.7, we get 1 LD ¡ 2
Z
1 ¡1
Z ¡2
1
¡1
Z
1 ¡1
Z Wc,c OQ c OQ c0 dc dc0 C 0
1 ¡1
rc OQ c dc C terms of order 2
2
Z
Qc U
zg0 (z ) dz dc
0
and higher.
(C.3)
Note that the deviation of the waveform of the solution from the linear attractor contributes to the Lyapunov function only in the order of 2 2 or
1024
Si Wu, Shun-ichi Amari, and Hiroyuki Nakahara
higher. The contribution in the rst order of 2 , however, is determined by the overlapping between rc and OQ c (the third term in equation C.3). Minimizing this term determines the position of stable state on the linear attractor and gives the estimation of stimulus. The above procedure can be formally done in a two-step perturbation procedure described in the text. Appendix D: The Network Implementation of FMLI
For FMLI, we consider that it is implemented by using the same recurrent neural eld as for UMLI, but with a different external input, which is Z 0 ¡1 0 (D.1) Ic D 2 hc,c 0 r c dc . Following the same line as for UMLI, we obtain a Lyapunov function, LD ¡
1 2
Z
1 ¡1
Z ¡2
1
Z
1 ¡1
Z
¡1
1
¡1
Z Wc,c0 Oc Oc0 dc dc0 C
1 ¡1
Z
Uc
zg0 ( z) dz dc
0
0 ¡1 0 Oc hc,c 0 r c dc dc ,
(D.2)
which can be approximately minimized in a two-step perturbation procedure, with the solution determined by Z Minfzg ¡
1 ¡1
Z
1 ¡1
0 ¡1 0 Oc hc,c 0 rc dc dc ,
2 2 subject to Oc D Ae¡(c¡z) / 2s .
(D.3)
Recall that the solution of FMLI is given by Max fxg ln PF ( r |x) , Z Z r2 1 1 ¡1 0 D 2 f ( c ¡ x ) hc,c 0 r c0 dc dc C terms not on x. s ¡1 ¡1
(D.4)
Comparing equations D.3 and D.4, we see that the network estimation is the same as that of FMLI. Acknowledgments
We thank the two anonymous reviewers for their valuable comments. S. W. acknowledges helpful discussions with Peter Dayan. H. N. is supported by Grants-in-Aid 11780589 and 13210154 from the Ministry of Education, Japan.
Population Coding and Decoding in a Neural Field
1025
References Abbott, L. F., & Dayan, P. (1999). The effect of correlated variability on the accuracy of a population code. Neural Computation, 11, 91–101. Akahira, M., & Takeuchi, K. (1981). Asymptotic efciency of statistical estimation: Concepts and high order asymptotic efciency. Berlin: Springer-Verlag. Amari, S. (1977). Dynamics of pattern formation in lateral-inhibition type neural elds, Biological Cybernetics, 27, 77–87. Baldi, P., & Heiligenberg, W. (1988). How sensory maps could enhance resolution through ordered arrangements of broadly tuned receivers. Biological Cybernetics, 59, 313–318. Brunel, N., & Nadal, J.-P. (1998). Mutual information, Fisher information, and population coding. Neural Computation, 10, 1731–1757. Cohen, M., & Grossberg, S. (1983). Absolute stability of global pattern formulation and parallel memory storage by competitive neural networks. IEEE Trans. SMC, 13, 815–826. Deneve, S., Latham, P. E., & Pouget, A. (1999). Reading population codes: A neural implementation of ideal observers. Nature Neuroscience, 2, 740–745. Eurich, C. W., Wilke, S. D., & Schwegler, H. (2000). Neural representation of multi-dimensional stimuli. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems,12 (pp. 115–121). Cambridge, MA: MIT Press. Fetz, E., Yoyama, K., & Smith, W. (1991). Synaptic interactions between cortical neurons. In A. Peters & E. G. Jones (Ed.), Cerebral cortex, 9. New York: Plenum Press. Gawne, T. J., & Richmond, B. J. (1993). How independent are messages carried by adjacent inferior temporal cortical neurons? J. Neuroscience, 13, 2758–2771. Georgopoulos, A. P., Kalaska, J. F., Caminiti, R., & Massey, J. T. (1982). On the relations between the direction of two-dimensional arm movements and cell discharge in primate motor cortex. J. Neurosci., 2, 1527–1537. Giese, M. A. (1999). Dynamic neural eld theory for motion perception. Norwell, MA: Kluwer Academic. Johnson, K. O. (1980). Sensory discrimination: Neural processes preceding discrimination decision. J. Neurophysiology, 43, 1793–1815. Lee, D., Port, N. L., Kruse, W., & Georgopoulos, A. P. (1998). Variability and correlated noise in the discharge of neurons in motor and parietal areas of the primate cortex. J. Neuroscience, 18, 1161–1170. Maunsell, J. H. R., & Van Essen, D. C. (1983). Functional properties of neurons in middle temporal visual area of the Macaque monkey. I. Selectivity for stimulus direction, speed, and orientation. J. Neurophysiology, 49, 1127–1147. Murata, M., Yoshizawa, S., & Amari, S. (1994). Network information criterion— determining the number of hidden units for articial neural network model. IEEE Trans. Neural Networks, 5, 865–872. Nakahara, H., & Amari, S. (in press). Attention modulation of neural tuning through peak and base rate in correlated ring. Neural Networks. Nakahara, H., Wu, S., & Amari, S. (2001). Attention modulation of neural tuning through peak and base rate. Neural Computation, 13, 2031–2047.
1026
Si Wu, Shun-ichi Amari, and Hiroyuki Nakahara
Paradiso, M. A. (1988). A theory for use of visual orientation information which exploits the columnar structure for striate cortex. Biological Cybernetics, 58, 35–49. Pouget, A., & Zhang, K. (1997). Statistically efcient estimation using cortical lateral connections. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural processing systems, 9 (pp. 97–103). Cambridge, MA: MIT Press. Pouget, A., Zhang, K., Deneve, S., & Latham, P. E. (1998). Statistically efcient estimation using population coding. Neural Computation, 10, 373–401. Salinas, E., & Abbott, L. F. (1994). Vector reconstruction from ring rates. J. Comp. Neurosci., 1, 89–107. Seung, H. S. (1996). How the brain keeps the eyes still. Proc. Acad. Sci. USA, 93, 13339–13344. Seung, H. S., & Sompolinsky, H. (1993). Simple models for reading neuronal population codes. Proc. Natl. Acad. Sci. USA, 90, 10749–10753. Snippe, H. P. (1996). Parameter extraction from population codes: A critical assessment. Neural Computation, 8, 511–529. Snippe, H. P., & Koenderink, J. J. (1992). Information in channel-coded systems: Correlated receivers. Biological Cybernetics, 67, 183–190. Treue, S., Hol, K., & Rauber, H.-J. (2000). Seeing multiple directions of motionphysiology and psychophysics. Nature Neuroscience, 3, 270–276. Wilke, S. D., & Eurich, C. W. (in press). Representational accuracy of stochastic neural population. Neural Computation. Wu, S., Chen, D., & Amari, S. (2000). Unfaithful population decoding. In Proceedings of the International Joint Conference on Neural Networks. Wu, S., Nakahara, H., & Amari, S. (2001). Population coding with correlation an unfaithful model. Neural Computation, 13, 775–797. Wu, S., Nakahara, H., Murata, N., & Amari, S. (2000). Population decoding based on an unfaithful model. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information systems, 12 (pp. 192–198). Cambridge, MA: MIT Press. Yoon, H., & Sompolinsky, H. (1999). The effect of correlations on the Fisher information of population codes. In M. S. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems,11 (pp. 167–173). Cambridge, MA: MIT Press. Zemel, R. S., Dayan, P., & Pouget, A. (1998). Probabilistic interpolation of population codes. Neural Computation, 10, 403–430. Zhang, K-C. (1996). Representation of spatial orientation by the intrinsic dynamics of the head-direction cell ensemble: A theory. J. Neuroscience,16, 2112–2126. Zhang, K., & Sejnowski, T. J. (1999). Neural tuning: To sharpen or broaden. Neural Computation, 11, 75–84. Zohary, E. (1992). Population coding of visual stimuli by cortical neurons tuned to more than one dimension. Biological Cybernetics, 66, 265–272. Zohary, E., Shadlen, M. N., & Newsome, W. T. (1994). Correlated neural discharge rate and its implications for psychophysical performance. Nature, 370, 140–143. Received February 2, 2001; accepted September 18, 2001.
LETTER
Communicated by Bard Ermentrout
The Inuence of Limit Cycle Topology on the Phase Resetting Curve Sorinel A. Oprisan
[email protected] Carmen C. Canavier
[email protected] Department of Psychology,Universityof New Orleans, New Orleans, LA 70148,U.S.A.
Understanding the phenomenology of phase resetting is an essential step toward developing a formalism for the analysis of circuits composed of bursting neurons that receive multiple, and sometimes overlapping, inputs. If we are to use phase-resetting methods to analyze these circuits, we can either generate phase-resetting curves (PRCs) for all possible inputs and combinations of inputs, or we can develop an understanding of how to construct PRCs for arbitrary perturbations of a given neuron. The latter strategy is the goal of this study. We present a geometrical derivation of phase resetting of neural limit cycle oscillators in response to short current pulses. A geometrical phase is dened as the distance traveled along the limit cycle in the appropriate phase space. The perturbations in current are treated as displacements in the direction corresponding to membrane voltage. We show that for type I oscillators, the direction of a perturbation in current is nearly tangent to the limit cycle; hence, the projection of the displacement in voltage onto the limit cycle is sufcient to give the geometrical phase resetting. In order to obtain the phase resetting in terms of elapsed time or temporal phase, a mapping between geometrical and temporal phase is obtained empirically and used to make the conversion. This mapping is shown to be an invariant of the dynamics. Perturbations in current applied to type II oscillators produce signicant normal displacements from the limit cycle, so the difference in angular velocity at displaced points compared to the angular velocity on the limit cycle must be taken into account. Empirical attempts to correct for differences in angular velocity (amplitude versus phase effects in terms of a circular coordinate system) during relaxation back to the limit cycle achieved some success in the construction of phaseresetting curves for type II model oscillators. The ultimate goal of this work is the extension of these techniques to biological circuits comprising type II neural oscillators, which appear frequently in identied central pattern-generating circuits.
c 2002 Massachusetts Institute of Technology Neural Computation 14, 1027– 1057 (2002) °
1028
Sorinel A. Oprisan and Carmen C. Canavier
1 Introduction
In order to gain an understanding of the central pattern generators (CPG) involved in rhythmic motor activity such as locomotion (Arshavsky, Beloozerova, Orlovsky, Panchin, Yu, & Pavlova, 1985; Beer, Chiel, & Gallagher, 1999; Canavier et al., 1997; Chiel, Beer, & Gallager, 1999; Collins & Stewart, 1993; Collins & Richmond, 1994; Golubitsky, Stewart, Buono, & Collins, 1998) and respiration (Abramovich-Sivan & Akselrod, 1998a), it is necessary to understand the phase-resetting behavior of the neural oscillators that comprise such circuits. We have examined the structure of phase resetting in neural oscillators in order to construct phase-resetting curves (PRCs) in response to single and multiple perturbations of arbitrary amplitude and duration. Neural oscillators that produce periodic rhythms such as burst ring can be modeled effectively as limit cycle oscillators (Baxter et al., 2000), and such burst ring neurons are frequently components of CPGs (Chiel et al., 1999; Beer et al., 1999; Collins & Richmond, 1994; Kopell & Ermentrout, 1988). In this study, we ignore action potential generation and instead focus on the underlying oscillations whose plateaus comprise the bursts and whose troughs comprise the interburst hyperpolarizations. Since physiological inputs to real neurons in a CPG circuit frequently have a duration that is a signicant fraction of the cycle period (Canavier, Baxter, Clarke, & Byrne, 1999), this work is a theoretical rst step that will be extended to include inputs of signicant duration. A CPG is a network of neurons in the central nervous system that is capable of producing rhythmic output (Chiel et al., 1999; Beer et al., 1999; Golubitsky et al., 1998; Kopell & Ermentrout, 1988) in the absence of input from higher centers and sensory feedback. Networks of nonlinear oscillators have attracted a great deal of research due to their theoretical and practical importance (Canavier et al., 1999; Ermentrout, 1985, 1986, 1994; Izhikevich, 2000). When acting as a component of a network, nonlinear oscillators can be characterized by their PRC (Abramovich-Sivan & Akselrod, 1998a, 1998b; Dror, Canavier, Butera, Clark, & Byrne, 1999; Ermentrout, 1996; Pinsker, 1977). The PRC can be obtained by measuring the change in the period of the limit cycle when a pulselike perturbation is applied at different points in the cycle (phases). The power of PRC methods has been clearly demonstrated by Winfree (1980) and Murray (1993), as well as by Canavier et al. (1997, 1999). Most theoretical work on coupled oscillators has used normal form, or canonical reduction, methods (Ermentrout, 1994, 1996; Hoppensteadt & Izhikevich, 1997). However, such methods involve restrictive assumptions that can limit the application of these methods to circuits composed of physiological neurons. In this article, we explore the theoretical aspects of PRC construction that is based on a geometrical understanding of the phase-space dynamics as originally developed by Winfree (Glass & Mackey, 1988; Hoppensteadt & Izhikevich, 1997; Murray, 1993; Winfree, 1980) and does not require restrictive assumptions or any infor-
Inuence of Limit Cycle Topology
1029
mation that cannot be easily obtained experimentally from physiological neurons. A convenient way to study periodic behavior is to characterize a limit cycle oscillator by its phase. The term phase has been used in the literature in several different ways. Here we will use Winfree’s denition (1980): the elapsed time measured from an arbitrary reference divided by the intrinsic period. We denote this term the temporal phase to distinguish it from the geometrical (or distance-based) phase dened below. In contrast to Winfree (1980), Murray (1993) and Hoppensteadt and Izhikevich (1997) dened the phase as the angle that parameterizes the unit circle. For a constant angular velocity on a circular limit cycle, the two denitions are equivalent. Given that limit cycles associated with nonlinear oscillators are usually not circular and can have highly variable angular velocity, we propose another denition that we call the geometrical phase. The geometrical phase can be dened as a distance measured from an arbitrary origin along the limit cycle divided by the intrinsic length of the limit cycle. Both denitions give quantities that are positive and normalized to unity. We stress that the temporally based denition of the phase (Winfree, 1980) has the same meaning as the distance-based one only for so-called phase oscillators, that is, the nonlinear oscillators running with a constant phase speed, or angular velocity. As an illustrative example, we will use the Morris-Lecar oscillator, originally formulated to model action potentials in barnacle muscle (Morris & Lecar, 1981; Rinzel & Lee, 1986; Ermentrout, 1996). We will use it in a different context, as a model for the envelope of a burst-ring waveform described above. A limit cycle oscillation requires a minimum of two variables (Hoppensteadt & Izhikevich, 1997; Guckenheimer & Holmes, 1983; Murray, 1993; Winfree, 1980); in the Morris-Lecar model, these variables are membrane potential (V) and a negative feedback variable (w) that corresponds to a gating variable for an ionic channel. The limit cycle oscillator can exhibit a range of dynamic activity that depends on the characteristic timescales of the two variables (Rinzel & Lee, 1986; Rinzel & Ermentrout, 1998). If they are comparable in magnitude, the qualitative dynamics of a phase oscillator arise, and the two variables tend to vary smoothly in unison. If w varies much more slowly than V, a relaxation oscillator can be produced in which alternating rapid increases and decreases in V (jumps) are followed by slow relaxation in w. The qualitative characterization of a limit cycle oscillator as a phase oscillator, a relaxation oscillator, or an intermediate determines the relationship between time elapsed (temporal phase) and distance traversed along the limit cycle (geometrical phase). A mapping between geometrical phase and temporal phase was found to provide insight into the shape of the phase-resetting curve. The Morris-Lecar model can exhibit two types of neural excitability. Neural excitability is the capability of a neuron to re repetitively in response to an externally applied current. For type I excitability, the onset of repetitive ring occurs at arbitrarily low frequency, whereas for type II excitabil-
1030
Sorinel A. Oprisan and Carmen C. Canavier
ity, the onset of repetitive ring occurs at a nonzero frequency (Hodgkin, 1948). The type of excitability is determined by the bifurcation structure: in type I, oscillations arise via a saddle node bifurcation, whereas in type II, they arise via a Hopf bifurcation (Rinzel & Ermentrout, 1998). The phase response characteristics of a neuron depend on the type of excitability it exhibits (Ermentrout, 1996): in general, type I neurons exhibit either only delays or only advances in response to perturbation of a given sign, whereas type II can exhibit both advances and delays to a given perturbation, depending on its timing. The phase response of the type I neuron is due to its essentially monotonic increase in membrane potential, except for a very rapid hyperpolarization phase of the action potential. Similar results apply to integrate-and-re neurons for the same reason (Mirollo & Strogatz, 1990). Our analysis assumed that for a neural oscillator, perturbations generally occur in the form of an increase or decrease in transmembrane current. The rst derivative of membrane potential is proportional to the transmembrane current, whereas the transmembrane current does not appear in the derivatives of the other state variables, which for neural oscillators are generally gating variables. In oscillators in which the frequency of the temporal waveform is simply scaled by an increase in transmembrane current level, such as type I or integrate-and-re models, the direction (advance or delay) of a brief perturbation can be determined using only the sign of the current perturbation and the sign of the rst derivative of membrane potential (the instantaneous transmembrane current) at the time of the perturbation. By convention, a depolarizing (hyperpolarizing) current applied externally has a positive (negative) sign and produces a positive (negative) change or increase (decrease) in the membrane potential. Hence, if the signs are the same, there is an advance, whereas if they are different, there is a delay. For an ionic current, such as a synaptic current, the sign convention is opposite. Furthermore, the magnitude of the instantaneous geometrical phase shift due to a brief perturbation is proportional to the magnitude of the time integral of the perturbing current. Once the geometrical phase shift due to a perturbation has been calculated, the mapping from geometrical to temporal phase can be used to construct the desired PRC for oscillators. When
Figure 1: Facing page. Geometric versus temporal phase. Two-dimensional phasespace plot of the limit cycles for (A) a hypothetical phase oscillator, (C) MorrisLecar (type I oscillator), and (D) FitzHugh-Nagumo (type II oscillator) relaxation oscillators. (B, D, F) The corresponding plots of the normalized geometrical distance (s) versus temporal phase (w ). For a phase oscillator, the gurative point travels with constant angular velocity around the limit cycle, and therefore the normalized geometrical distance (s) increases linearly (B). For the relaxation oscillators (D, F) the gurative point travels with variable speed along the limit cycle. (A) The four quadrants of a limit cycle; (F) the conversion from geometrical to temporal phase.
Inuence of Limit Cycle Topology
1031
1032
Sorinel A. Oprisan and Carmen C. Canavier
transmembrane current levels have more complex effects than simply scaling the frequency, additional calculations are required to compensate for effects on the amplitude of the temporal waveform. 2 A Geometric Analysis of Phase Resetting 2.1 A Mapping from Temporal to Geometrical Phase. The phenomenology of phase resetting can be best understood by visualizing a limit cycle, such as those shown in Figures 1A, 1C, and 1E, represented for convenience in a two-dimensional plane. Consider a system of N differential equations:
du k D fk ( u) , dt
(2.1)
where u1 D V. We used the innitesimal Euclidean distance, v u N uX ds D t ( fk (u ) ) 2 dt,
(2.2)
kD 1
to dene the geometrical phase along the limit cycle (Oprisan & Canavier, 2000a). Therefore, the length of a closed orbit is 0v 1 ZPi u N uX @t ( fk ( u ) ) 2 A dt LD i D1
0
where Pi is the intrinsic period of the motion. The normalized geometrical (distance-based) phase along a limit cycle is given by 1 sD L
Zt 0
0v 1 u N uX @t ( fk (u ) ) 2 A dt.
(2.3)
kD 1
The geometrical phase s is a monotonically increasing function of both the elapsed time t and the temporal phase w D Pti , where Pi is the intrinsic period of the limit cycle motion (see equations 2.3 and 3.2). The function s (w ) gives an invertible mapping between temporal and geometrical phase. Some examples of typical mappings are shown in Figures 1B, 1D, and 1F. For a phase oscillator, the geometrical and temporal phase are identical (see Figure 1B), whereas Figures 1D and 1F show that the mapping for relaxation oscillators can be highly variable and idiosyncratic. A remarkable property of the geometrical to temporal phase mapping is its invariance. This mapping can be obtained using equation 2.3 if the evolution equations are known or by performing a time series delay embedding
Inuence of Limit Cycle Topology
1033
to reconstruct the attractor (Takens, 1981; Abarbanel, Brown, Sidorowich, & Tsimring, 1993) rst and then integrating along the numerically computed trajectory. We obtained a geometrical to temporal phase mapping for the Morris-Lecar model at the xed parameter settings given in the caption of Figure 2. We used direct integration of the differential equations, as well as attractor reconstruction using delay embedding with two different delays, and in every case the agreement was very good (see Figure 2) (see appendix A for a detailed proof of invariance under certain assumptions). 2.2 Phase Resetting During a Perturbation. A geometrical perspective on phase resetting is given in Figure 3. A perturbation (the dark solid line in
Figure 2: Dynamical invariance of the reconstructed geometrical to temporal phase mapping. The geometrical to temporal phase mapping was obtained by integrating the differential equations of the ML model oscillator in the type I excitability case. Using only the dimensionless membrane potential record, the phase-space attractor was reconstructed using two different time lags (t D 100, dotted line and t D 200, dot-dashed line), and for each reconstructed attractor, the geometrical to temporal phase mapping was computed.
1034
Sorinel A. Oprisan and Carmen C. Canavier
Figure 3: Topology of phase resetting. The phase-space plot of a twodimensional stable limit cycle. A perturbation applied at time t 0 for a duration of t1 perturbs the trajectory (dark line). After the perturbation, the trajectory requires a time t2 to return to the limit cycle.
Figure 3) applied at time t 0 has a duration of t1 . At the end of the perturbation, the trajectory may be at some distance from the limit cycle, and it relaxes back to the limit cycle (dashed line) during the time period t2 . The temporal phase resetting is the normalized difference in elapsed time along the perturbed and unperturbed paths over the time period t1 C t2 . While t1 is easily determined, t2 is not. Fortunately, for the simple case of type I excitability, it is possible to neglect the resetting that occurs during t2 , but we will rst propose expressions that give the geometrical phase resetting for both t1 and t2 (see Oprisan & Canavier, 2001) and then recover the temporal phase resetting by using the inverse mapping from geometrical to temporal phase. The main idea is to obtain the projection of the perturbed trajectory onto the unperturbed trajectory. Physiological perturbations generally arise in the form of a perturbation in current, which affects only VP or f1 (see equation 3.1) and no other derivative (see also Nishii, 1999). The evolution equation P Iionic C ICext , where of the membrane potential has the general form VP D ¡ Cm m P Iionic stands for the sum of all the ionic currents, Cm is the membrane capacitance, and Iext is an externally applied current. The projectionof the vector representing the perturbation ( i ( t) / Cm , 0, 0 . . .) , where i ( t) is the perturbing current as a function of time, onto the vector containing the time derivatives fk ( u ) of the state variables evaluated along the limit cycle would produce the normalized innitesimal distance traveled along the limit cycle: ds1 D
1 i f1 ( t ) dt. L Cm || f ( t) ||
(2.4)
Inuence of Limit Cycle Topology
1035
Thus, the geometrical phase resetting during the perturbation of duration t1 would be Z 1 t 1 i f1 ( t ) (2.5) D s1 D dt. L 0 Cm || f ( t) || If we assume that the perturbation has a very short duration, it is not neces( ) sary to integrate || ff1( tt) || over the course of the perturbation; rather, we assume this quantity remains constant for the duration of the perturbation (Glass & Mackey, 1988; Guckenheimer & Holmes, 1983; Hoppensteadt & Izhikevich, 1997; Murray, 1993; Pavlidis, 1973). For simplicity of presentation, we assume that i is also constant for the duration t1 of the perturbation:
D s1
D
1 t1 i f1 ( t0 ) . L Cm || f ( t0 ) ||
(2.6)
Figure 4 illustrates the relationships among the vectors described above for a two-dimensional case for a phase advance in Figure 1A and a delay in Figure 1B. In each case, the vector s 0 represents the unperturbed limit cycle, the vector s¤ represents the perturbed limit cycle, and D s is the geometrical resetting before normalization by L, which then gives D s1 . The derivatives with respect to voltage are given by f1 evaluated on the limit cycle at the point of the perturbation and by f1¤ during the perturbation. The derivative with respect to the second variable, f2 , is the same for the perturbed and unperturbed cases. The vectors are dened as follows: sE0 D [t1 f1 ( t0 ) , f2 ( t0 ) ] and sE¤ D [t f1 C D V, f2 ( t0 ) ]. The vector sE0 would be traversed by the trajectory in the absence of a perturbation, and the vector sE¤ would be traversed in its presence. The direction of the perturbation is given by the difference f1¤ t1 ¡ f1 t1 , which is equal to it1 / Cm , or D V. One can obtain the normal displacement D h from the tangential displacement using similar triangles: ( ) D h D ff21 ( tt00 ) D s1 . This relationship becomes useful in the next section. The temporal phase resetting is recovered using
Dw1
D s ¡1 ( D s1 ) .
(2.7)
The recovery can be accomplished graphically as shown in the lower portion of Figure 1F. The calculated D s is added to the geometrical phase at the point at which the perturbation is received. The change in temporal phase is given by the horizontal displacement required to reach the new value of s. 2.3 Phase Resetting During Relaxation Back to the Limit Cycle After a Perturbation. As shown in Figure 3, it is possible for the trajectory to
be at some normal displacement from the unperturbed limit cycle after the perturbation has ended, so that additional phase resetting is incurred during t2 , the relaxation back to the limit cycle. As we will show in the
1036
Sorinel A. Oprisan and Carmen C. Canavier
Figure 4: Idealized trajectory of a perturbation. The linearized tangent displacement along the unperturbed limit cycle s0 and its perturbed counterpart s¤ . The perturbed trajectory s¤ is induced by a perturbation in transmembrane current: (A) the depolarizing case and (B) the hyperpolarizing case. The tangent geometrical phase shift D s and the transverse displacement ff2 D s can be easily written 1 in terms of the unperturbed f and perturbed f ¤ vector elds.
Inuence of Limit Cycle Topology
1037
Table 1: Phase Resetting Due to Normal Displacement from the Limit Cycle. D s1
f1 / f2
Dh
C
C C
¡
Quadrant
f1
DV
I
¡
¡
C
¡
¡ ¡
C
¡
¡
¡
¡
II III IV
C C C C
C
¡ ¡
C
C
¡ ¡
C C
C C
C
¡ ¡
¡
¡ C
D s2 Inside Outside Inside Outside Outside Inside Outside Inside
C
¡ C
¡ ¡ C
¡ C
next section, for some types of oscillators, this additional phase resetting (D s2 and D w 2 ) can be neglected. For the case in which there is signicant normal displacement that cannot be ignored, we have begun to calculate the appropriate corrections by initially assuming that the linear velocity of a trajectory just inside or just outside the limit cycle is approximately the same as that of a point on the limit cycle. However, in general, the angular velocity will be greater inside the limit cycle and smaller outside the limit cycle. Hence, trajectories that are horizontally displaced inside the limit cycle will be advanced, and those outside will be delayed. As a rst approximation, we made the assumption that D s2 is proportional to the normal distance D h between the perturbed and unperturbed trajectories: D s2 D KD h. The normal distance can be calculated as Ás
Dh D
N P
!
( fk ( u) 2
i D2
f1
D s1 .
(2.8)
For the two-dimensional case, the expression
Dh D
f2 D s1 , f1
(2.9)
always gives the correct sign for the resetting. A limit cycle can be divided into four quadrants (see Figure 1A): in I, f1 < 0 and f2 < 0, in II f1 > 0 and f2 < 0, in III f1 > 0 and f2 > 0, and in IV, f1 < 0 and f2 > 0. The sign of D s1 f is given by sgn ( f1 D V ) , and the sign of D s2 is given by sgn ( f21 D s1 ) , and in every case, a normal displacement outside the limit cycle gives a positive D s2 , which corresponds to an advance, and a displacement inside the limit cycle gives a negative D s2 , or a delay (see Table 1). The remaining problem is to assign a value to the proportionality K. The constant of proportionality K between the normal displacement and its corresponding additional tangential displacement was found by calculating the time required for relaxation to the limit cycle, assuming the temporal
1038
Sorinel A. Oprisan and Carmen C. Canavier
phase resetting was proportional to the relaxation time, and converting this temporal resetting to geometrical resetting and nding the appropriate K. First, we determined the maximum normal distance (D hmax ) between the unperturbed limit cycle and a perturbed limit cycle with a current perturbation of i. The normalized time D w max (with respect to the intrinsic period of the unperturbed limit cycle) necessary for the gurative point to relax from the maximum displacement sufciently close to the unperturbed limit cycle was determined. D w max was converted to D smax using the temporal to geometrical mapping. The constant K is the ratio D smax / D hmax . Since the time required to relax back to the limit cycle (t2 in Figure 3) is theoretically innite, we used instead the time required to relax to within a neighborhood of the limit cycle dened by the limits of our computational accuracy (Izhikevich, 2000). A plot of the natural logarithm of normal displacement versus time (not shown) was linear until this limit was reached; then it became at. The time to reach the end of the linear region was normalized to give D w max . Once again, the temporal phase resetting is recovered as
D w2
Ds
¡1
(D s2 ) ,
(2.10)
and the total phase resetting is given by D w D D w 1 C D w 2 . 3 Results Using the Morris-Lecar Model as an Example
The Morris-Lecar model has the advantages of simplicity (it has only two state variables) and versatility (it can exhibit either type I or type II excitability, depending the values of its parameters). We used the following general dimensionless form of the nonlinear oscillator, VP D f ( V, w) C Iext D f1 (V, wI Iext ) , wP D g ( V, w) D f2 (V, wI Iext ) ,
(3.1)
where V stands for the fast variable, w for the slow variable, and Iext for the control parameter (external applied current). Two-dimensional reduced models like Morris-Lecar (Morris & Lecar, 1981; Ermentrout, 1994) and FitzHugh-Nagumo (FitzHugh, 1969) have similar dimensionless forms. The ionic currents can be written in terms of membrane potential V and slow variable(s) (see appendix B). The expression for geometrical phase resetting due to tangential displacement during a perturbation (see equation 2.6) becomes
D s1 P where f1 D VP and f2 D w.
D
1 t1 i f1 ( t 0 ) p , L Cm f1 (t 0 ) 2 C f2 (t 0 ) 2
(3.2)
Inuence of Limit Cycle Topology
1039
In order to evaluate the success of our methods for constructing the PRC using the methods described in the preceding section, we numerically generated actual PRCs for the Morris-Lecar model using an explicitfourth-order Runge-Kutta method to solve the nonlinear equations B.1. The simulations were compiled on C/UNIX and run on a Sun Enterprise 450 Ultra Server. First, the differential equations were numerically integrated for until the solution converged to the limit cycle and the period Pi of the steady solution was computed. Then a steplike current pulse was applied at different temporal phases (w ) during the cycle with respect to an arbitrary xed reference point on the limit cycle (we used the maximum of the membrane potential as a reference point, but different points can be used Canavier et al., 1997, 1999). The perturbed period P of the membrane potential is measured, and the quantity F (w ) D P / Pi ¡ 1 is computed for each value of w . F (w ) was used for consistency with our previous articles and is equal to ¡D Q . These actual PRCs were used for comparison with the theoretical PRCs. For precise continuity near the reference point, the second perturbed period after the pulse is applied should also be tabulated and the effect on both periods summed to produce F (w ) (Canavier et al., 1997, 1999), but the second perturbed pulse was not examined in this article. 3.1 Phase Resetting Induced in Type I and Type II Oscillators by a Single Input. The actual PRC for a type I oscillator (parameters given in
appendix B) receiving an excitatory input is shown in Figure 5A (dotted line), and compared with the PRC constructed according to the methods described in the preceding section (dashed line). Only the contribution from D s1 was included in the calculation; any contribution from D s2 was neglected. It can be seen from the common plot of actual and predicted PRCs result that our geometrical method gives a very good estimation of the PRC for the type I oscillator. As we expected, based on analytic results (see equation 3.2), the amplitude of the PRC is proportional to pulse duration t1 (see Figure 5B) and also to the pulse amplitude as shown in Figure 5C. The shape of the geometrical phase shift curve can be anticipated by examining equation 3.2 together with the waveform of the membrane potential (see Figure 6C). Since the membrane potential is always increasing except in a very short time interval, we would expect a depolarizing external stimulus to produce a negative phase shift (an advance), as is indeed the case in Figure 5A. However, this is a very small temporal window in which the membrane potential is decreasing, and a perturbation applied at this time would produce a phase shift of the opposite sign. As a practical matter, such an anomalous phase shift would be difcult to observe in a physiological type I oscillator, but in a model, we have the precision to detect it. Hence Figure 7 illustrates how this region of anomalous phase shift narrows as the bifurcation that generates type I excitability is approached. A numerical computation of the ratio of time interval during which the membrane potential is decreasing to the period of the signal is shown in Figure 7. The time
1040
Sorinel A. Oprisan and Carmen C. Canavier
Inuence of Limit Cycle Topology
1041
Figure 6: Effect of bias current on limit cycle topology and angular velocity (frequency). Two-dimensional limit cycles for (A) Morris-Lecar type I and (B) type II excitability cases. The curves labeled 1 and 2 were obtained at different values of Iext . For type I excitability, the topology of the attractor is not very sensitive to changes in the control parameter Iext (A), whereas the period of the limit cycle motion is sensitive to the same change in Iext (C). On the other hand, the topology of the attractor obtained via a type II excitability mechanism (Hopf bifurcation) is dramatically inuenced by a change in the control parameter Iext (B) but the period is not (D).
Figure 5: Facing page. Predicted (analytic) versus actual (numerically computed) type I phase resetting. (A) The numeric (continuous curve) and analytically mapped (dotted curve) PRCs for type I excitability case. Perturbation duration Pi was t D 100 using equation 3.2 and the graphic mapping procedure. The steady dimensionless current was Iext D 0.0695, and the pulse amplitude was 10% of Iext . A summary of numerical simulations conrms that the PRC amplitude scales with the product of current amplitude (B) and its duration (C) as theoretically predicted by the equation 3.2.
1042
Sorinel A. Oprisan and Carmen C. Canavier
Figure 7: Deviation of type I model from resetting curve with a single sign. The deviation is plotted as the fraction of the cycle over which the phase resetting has the opposite sign from the idealized type I PRC, in which all phase resets in response to a given input have the same predictable sign. When the control parameter approaches the bifurcation point (criticality), the relative magnitude of the time interval over which the membrane potential decreases becomes negligible.
interval over which the membrane potential decreases from its maximum to minimum is almost constant over a very large control parameter range, but as the bifurcation (criticality) is approached, the period increases, so the ratio decreases. Theoretically, in the limiting case of criticality, the period of the membrane potential oscillation becomes innite because the phase trajectory approaches the separatrix. On the other hand, using our observation that the time interval over which the membrane potential decreases from its maximum to minimum is nearly constant over a very large range of Iext and Ermentrout’s (1996) formula for the period of the membrane oscillations T / ( Iext ¡ Ic ) ¡1 / 2 , where Ic is the critical value of the control
Inuence of Limit Cycle Topology
1043
parameter at the saddle-node bifurcation point, it is possible to derive an analytic expression for the plot in Figure 7. Therefore, the computational p determined ratio can be satisfactorily tted by ( Iext ¡ Ic ) . These two observations lead to the conclusion that the relative magnitude of positive part of PRC (which is related to the relative contribution of the decreasing membrane potential to the whole signal duration) decreases to zero as we approach criticality. Rinzel and Ermentrout (1998) make the same point: “In only a very small interval of time can the phase be delayed, and this is a general property of membranes that become oscillatory through a saddle node bifurcation” (p. 286). The temporal waveform given in Figure 6C explains the general shape of the PRC, and the effect of a perturbation in current on both the phase plane representation (see Figure 6A) and the temporal waveform (see Figure 6C) explains why the phase resetting due to relaxation after a normal displacement is not signicant for oscillators that exhibit type I excitability. An important topological feature of the associated limit cycle attractor shown in the phase-plane representation of Figure 6A is its invariance with respect to change in the applied current. On the other hand, the period of the limit cycle motion is highly sensitive with respect to such changes (see Figure 6C). For type I excitability, a perturbation in current produces a negligible normal displacement from the limit cycle. Therefore, in this case, the geometrical phase difference occurs solely because of different speeds of the gurative points along the limit cycle. Quantitatively, the normal displace( ) ment from the unperturbed limit cycle is proportional to ff2 ( tt0 ) . For the type I 1 0 oscillators that we examined in this study, we found that the average value of the above ratio is about 2.8 10¡4 using the dimensionless (normalized) forms of V and w, whereas for type II oscillators, the average value was 2.8. Thus, the normal displacement induced by the same perturbation is four orders of magnitude smaller for type I oscillators than for type II oscillators. This explains the excellent results using only D s1 representing tangential displacement during a perturbation for type I. Since the normal displacement is signicant for type II oscillators (see Figure 6B), we expect that we will have to include the contribution D s2 from relaxation after normal displacement. Type II excitability arises via a Hopf bifurcation. In contrast to type I excitability, the geometry of the associated limit cycle attractor is very sensitive to control parameter changes (see Figure 6B), but the period of the motion changes more slowly than for type I excitability (see Figure 6D). As we expected, calculating the PRC using only the contribution of D s1 (dotted line, Figure 8A) did not produce a good t to the actual PRC. For the two-dimensional case, the expression for D s2 becomes
D s2
DK
f2 ( t 0 ) D s1 , f1 ( t 0 )
(3.3)
1044
Sorinel A. Oprisan and Carmen C. Canavier
Inuence of Limit Cycle Topology
1045
for type II parameter settings (given in appendix B). Adding the contributions of both D s1 and D s2 produced a better t (see Figure 8B), but not the nearly perfect agreement achieved for type I. Possible future directions for improving the t are given in section 4. 3.2 Phase Resetting Induced in a Type I Oscillator by Multiple Inputs per Cycle. We generalized the above methodology to the case in which the
neural oscillator receives more than one input during a cycle (Oprisan & Canavier, 2000b). Only the simpler case of type I excitability is addressed so that we could assume the trajectory returns to the limit cycle between successive perturbations. We considered the simplied case of two identical current pulses, meaning that t1 D t2 and D Iext 1 D D Iext 2 (see Figure 9A). Here, t1 and t2 denote the duration of two distinct pulses rather than the duration of the perturbation and the associated relaxation back to the limit cycle as in the preceding sections. t1 denotes the time at which the rst pulse current is applied (see Figure 9A). As a consequence of this current perturbation, the geometrical phase shift induced at time t1 is given by the equation 3.2. Another current pulse is applied at time t2 . The geometrical phase shift is again given by equation 3.2, but the temporal phase at which the second pulse is applied is t1 / Pi C D w 1 , where D w 1 is the temporal phase shift corresponding to the geometrical phase shift previously determined for the rst pulse. Therefore, if we denote by F[1] (w ) the single-pulse PRC as it was dened in the preceding section, then for two identical current pulses acting during the same cycle, the PRC becomes ³ ´ t2 ¡ t1 F[2] (w ) D F[1] (w ) C F[1] w C F[1] (w ) C , Pi
(3.4)
where Pi is the intrinsic period of the oscillation for the steady external current value Iext 1 and t1PCi t2 < 1 in order to deliver both the current pulses during the same cycle (Ermentrout & Kopell, 1991b). This analysis also can be generalized to more than two inputs per cycle with arbitrary amplitudeduration ratios. We performed extensive numerical simulations to verify our predictions using a Morris-Lecar model neuron. Figure 10 summarizes the agreement between the predicted two-pulse PRC (continuous curve)
Figure 8: Facing page. Predicted (analytical) versus actual (numerically computed) type II phase resetting. (A) Uncorrected and (B) corrected for relaxation numeric (continuous curve) and analytic (dotted curve) PRCs for type II exPi citability. The duration of the perturbing pulse was t D 100 . The steady dimensionless current was Iext D 0.25, and the pulse amplitude was 10% of Iext . This anomalous region is given due to the nite cycle fraction during which membrane potential is rapidly decreasing, whereas it increases during the remainder of the cycle.
1046
Sorinel A. Oprisan and Carmen C. Canavier
Figure 9: Generalization of the resetting topology to multiple pulses. (A) Two brief current pulses with the same amplitude D Iext D Iext 2 ¡ Iext 1 and duration t1 D t2 acting during the same cycle. (B) The gurative point is accelerated along the unperturbed limit cycle when perturbation is rst applied (t1 ) for the interval t1 . When the rst perturbation switches off, the gurative point again moves along the unperturbed limit cycle until a second pulse is applied, at time t2 . The gurative point is again accelerated along the limit cycle and moves along it. When the second perturbation terminates, the gurative point again moves along the unperturbed limit cycle. Mapping the geometrical phase differences into temporal phase differences, we found the PRCs for multiple inputs per cycle (see B).
Inuence of Limit Cycle Topology
1047
Figure 10: Predicted versus actual phase resetting for multiple pulse protocol. Common plot of predicted (continuous line) and actual numerically computed (dotted line) PRCs with two identical current pulses applied during the same cycle. The pulse’s amplitude was 1% of the steady external current, and its duration was 1% of the intrinsic period Pi . The only variable was the distance between current pulses: (A) 20%Pi , (B) 40%Pi , (C) 60%Pi , and (D) 80%Pi .
and its numerically computed actual counterpart (dotted curve) for variable distances between the two current pulses. The agreement between the actual and predicted values is very good. We also found that the shape of the current pulse is not critical. Numerical simulations using a presynaptic input from an identical Morris-Lecar model neuron rather than a square pulse gave results in agreement with equation 3.4. Another recent study (Foss & Milton, 2000) also showed that the shape of the presynaptic input does not dramatically change the response of the postsynaptic neuron. 4 Discussion and Conclusion
The improvement of the theoretical understanding of the PRC is signicant. First, we used an unambiguous denition of geometrical (distance-based) phase that applies for every nonlinear oscillator even in high-dimensional spaces. Second, we provided a geometrical conceptual basis for phase reset-
1048
Sorinel A. Oprisan and Carmen C. Canavier
ting. The phase-space analysis that we performed revealed complex problems posed by phase-resetting mechanisms and suggested analytic approaches in order to solve these problems. A main result of this article is a general method that allows the analytic construction of the PRC in addition to previously established computational methods. The relationship of our approach to previous approaches to this problem is given in appendix C. Although we referred to a particular model (Morris-Lecar) in order to compare our ndings with numerically reported ones, the method we described is generally applicable. The theoretical predictions have been conrmed by numerical solution. For type I excitability, the agreement is quite satisfactory and generalizes to any arbitrary combination of pulses. For type II excitability, some disagreement persists, and further theoretical work is required. One source of error in the prediction of type II PRCs is the uncertainty in the tangential displacement due to relaxation after normal displacement. This tangent displacement was assumed to be linearly proportional to the normal displacement. However, for a very simple circular limit cycle with radius r0 and a constant angular velocity at all points on the limit cycle and a constant linear velocity at all points near the limit cycle, it can be shown that a normal displacement of D r produces a tangential displacement proportional to ln( r0 Cr0D r ) . Since limit cycles are not in general circular, the quantities r 0 and D r are not dened for arbitrary limit cycles such as those shown in Figure 6. Nevertheless, perhaps a better approximation of the dependence of tangent displacement (geometrical phase) on normal displacement can be made by dening analogous quantities for arbitrary limit cycles. The ultimate goal of this work is to apply the results to the analysis of circuits composed of physiological oscillators, including bursting neurons. The fundamental assumption of phase-resetting analysis is that the only effect of a perturbation is to lengthen or shorten the cycle in which a perturbation is received. When a neuron is a component of a network, the burst in that component neuron serves as the perturbation received by the neurons on which it makes synaptic connections. Thus, an alteration in burst duration or amplitude may have an impact on the resetting induced in the target neurons. To measure the magnitude of the signal shape change induced by perturbation, we evaluated the relative change in the maximum amplitude of the signal. For type I excitability, the maximum amplitude of membrane potential is negligibly affected by perturbations, as one would expect from the coincidence of perturbed and unperturbed limit cycles. However, burst duration may be affected. Similarly, again as one would expect, for type II excitability, we observed that the maximum amplitude of the membrane potential signicantly changes in a manner strongly correlated with the PRC. Preliminary work (Oprisan & Canavier, 2000a, 2000b) shows that these effects of perturbations do not signicantly compromise our ability to analyze circuits of coupled Morris-Lecar oscillators using their PRC characteristics. The work presented in this article will help us to analyze physiological cir-
Inuence of Limit Cycle Topology
1049
cuits without having to generate PRCs experimentally in every combination of number, timing, amplitude, and duration of perturbations that might result from the oscillatory activity of the circuit, but rather to generalize from a small number of PRCs, we hope as few as one per neuron. How can this method be extended to physiological neurons? Although we do not have access a priori to the equations that govern the electrical activity of a given neuron, this method can still apply to physiological neurons because we have observed that the geometrical to temporal phase relationship that we obtained here using equation 2.3 is similar to that we can numerically extract using membrane potential record. More generally, we can replace the described Euclidean-based approach (which requires analytic expressions of the vector eld) by a computational geometrical phase construction based entirely on a delayed embedding using the fast variable record (Takens, 1981). In order to obtain a slow variable, ltering techniques may be required. Once we have obtained an appropriate phase-plane reconstruction, the geometrical to temporal phase mapping can be obtained and then the predicted phase resetting induced by external perturbations. We predict (Oprisan & Canavier, 2000a, b) that these two steps can give us the PRC for neurons (such as cortical neurons) that are type I oscillators (Wilson & Bower, 1989; Traub & Miles, 1991). Experimentally obtained PRCs can then be used to conrm the validity of our theoretical analysis. In addition, these methods can be extended to type II neural oscillators, which appear frequently in physiological central pattern generators, by experimentally estimating the proportionality constant K to correct for phase resetting that occurs during relaxation after normal displacement from the limit cycle. Appendix A: The Dynamical Invariance of the Mapping Between the Geometrical Phase and the Temporal Phase
The dynamical invariance of the geometrical to temporal phase curve, which is a homeomorphism from the limit cycle to unit circle, is especially important for our short-term objective, which is the recovery of the PRC from a phase-space reconstruction of the attractor. The sketch of the proof that the geometrical to temporal phase curve is an invariant is as follows. Consider a one-dimensional time series, such as the membrane potential record, for example. x[1], x[2], x[3], . . . ,
(A.1)
where the x[i] are evenly sampled at intervals of D t, the time delay t is an integral multiple of D t, and D is the embedding dimension of the appropriate phase space. Then the following D-dimensional vectors describe the attractor: y[i] D x[i], x[i C t ], x[i C 2t], . . . , x[i C ( D ¡ 1)t ].
(A.2)
1050
Sorinel A. Oprisan and Carmen C. Canavier
The Euclidean distance between two successive points y[i] and y[i C 1] along the reconstructed attractor is R2i,i C 1 ( D ) D
D¡1 X kD 0
(x[i C 1 C kt ] ¡ x[i C kt ]) 2 ,
(A.3)
which gives for the length of the closed trajectory L ( D ) D
N P iD 1
Ri,i C 1 ( D ) , where
N is the number of points along the closed trajectory and x[i] D x[i C N]. The geometrical distance (before normalization) after j < N steps along the limit cycle is
sj (D ) D
j X i D1
Ri,i C 1 (D ) .
(A.4)
Let D be the minimum embedding dimension of the attractor. Reconstructing the attractor in 2D-dimensional phase space, we obtain: R2i,i C 1 (2D) D
2¤D¡1 X kD 0
( x[i C 1 C kt] ¡ x[i C kt] ) 2 .
(A.5)
This can be rewritten as: R2i,i C 1 (2D ) D
D¡1 X kD 0
( x[i C 1 C kt ] ¡ x[i C kt]) 2 C
D R2i,i C 1 ( D ) C
D R2i,i C 1 ( D) C
D¡1 X jD 0
D¡1 X jD 0
2D¡1 X kD D
( x[i C 1 C kt] ¡ x[i C kt] ) 2
( x[i C 1 C (D C j) t ] ¡ x[i C ( D C j)t ]) 2
( x[i C 1 C Dt C jt ] ¡ x[i C Dt C jt ]) 2 .
(A.6)
The lag time t should be as small as possible to capture the shortest changes in the data, yet it should be large enough to generate the maximum possible independence between the components of the vectors in the phase space. Since the time span Dt of each vector in the phase space represents the duration of a state of the system, it should be at most equal to the period of the dominant frequency Dt D N (both t and N are integer multiples of the time step D t). Therefore, x[i C Dt ] D x[i C N] D x[i], where the last equality
Inuence of Limit Cycle Topology
1051
means that the period of the attractor is N. Using the above observation, the above expression simplies to R2i,i C 1 (2D ) D 2R2i,i C 1 ( D ).
(A.7)
Thus, the geometrical distance sj ( D ) after j steps along the limit cycle and the total length of the limit cycle L ( D) show the same scaling with respect to the (2¤D ) sk (D ) embedding dimension: sLk(2¤D ) D L ( D ) . Our numerical computations show that the general scaling relationship is s k (a ¤ D ) D s k ( D) (a) c with c slowly dependent on k with a saturation threshold c ’ .5, which also supports our conjecture that the mapping between geometrical and temporal phase is a dynamical invariant. Appendix B: The Dimensionless Morris-Lecar Model Neuron
We use the Morris-Lecar model as an example of a simple neural model. Despite its simplicity, it produces a facsimile of the membrane potential envelope of a neural oscillator. The dimensionless equations are (Morris & Lecar, 1981; Rinzel & Lee, 1986; Rinzel & Ermentrout, 1998): dV D Iext ( t) ¡ gCa m1 ( V ) ( V ¡ VCa ) ¡ gK w ( V ¡ VK ) ¡ gL ( V ¡ VL ) D f1 ( V, w) , dt dw (B.1) D wl(V) ( w1 ( V ) ¡ w) D f2 ( V, w) , dt ± ± ²² ± ± ²² 1 3 with m1 ( V ) D 12 1 C tanh V¡V , w1 (V ) D 12 1 C tanh V¡V , l(V) D V2 V4 ± ² 3 cosh V¡V . 2V4 For type I excitability, we have (Rinzel & Ermentrout, 1998) the following values: V1 D ¡0.01; V2 D 0.15; V3 D 0.1; V4 D 0.145; VK D ¡0.7; VL D ¡0.5; VCa D 1.0; gCa D 1.33; gK D 2.0; gL D 0.5; w D 0.33; I D 0.0705. For type II excitability, the parameters have the same values except V3 D 0.017; V4 D 0.25; gCa D 2.2; gK D 4.0; gL D 1.0;w D 0.4; I D 0.4. Appendix C: Comparison with Previous Studies
There are similarities between our method of predicting phase resetting and those given in the appendix of Ermentrout and Kopell (1991a, b), but there are some signicant differences as well. Our derivation uses geometrical intuition, whereas that of Ermentrout and Kopell is analytic. In Ermentrout and Kopell (1991a, b), a system of N state variables representing oscillator k that is coupled to oscillator j as a function of time, du k D f k ( u k ) C g ( u k , uj ) , dt
(C.1)
1052
Sorinel A. Oprisan and Carmen C. Canavier
is converted by a change of variables to a system that is a function of phase (h) around the limit cycle solution to the original equations, as well as N ¡ 1 variables that measure normal distance from the limit cycle. These latter N ¡ 1 variables can be ignored by assuming a strong attraction to the limit cycle, which renders these N ¡ 1 variables negligibly small (as in a type I oscillator). Ermentrout and Kopell further assume that the phase variables of two coupled oscillators are never perturbed independently, which reduces 0 the system to a single phase variable. The vector U k contains the derivatives evaluated along the limit cycle of each of the original state variables with respect to phase (dU k / dh): w
du k 0 dhk D U k (h ) , dt dt
(C.2)
where h ( t) D wt and dh D wdt. Combining equations C.1 and C.2 yields 0
w ¡1 U k (h )
dh k D f k ( Uk ) C g ( U k , Uj ) . dt
(C.3) 0
Multiplying both sides of the equation C.3 by the inverse of Uk , which is 0 0 0 UkT / ||Uk | | 2 and using the fact that fk ( Uk ) D U k (h (t ) ) , the following expression results: 0
Uk dhk DwCw g ( U k, Uj ) . 0 dt | |U k || 2
(C.4)
A similar expression appears in Hoppensteadt and Izhikevich (1997, p. 255). By analogy with equation A4 of Ermentrout and Kopell (1991a, b), we can U
0
dene hk (h k, hj ) as rkk g (U k, Uj ) . Both our derivation and that of Ermentrout and Kopell (1991a, b) take advantage of the fact that the coupling is in one variable (V) only, hence for the particular form of perturbation. They assumed in a system of two coupled Morris-Lecar oscillators: g (U k, Uj ) D [gex ( Vk (h ) ( Vex ¡ Vj (h ) ) , 0].
(C.5)
The nonzero term is simply the scaled instantaneous current and is equivalent to the scaled instantaneous current in our derivation, I / CM . We assumed a constant pulse, but any arbitrary pulse can be used, and the integral of that scaled current over the perturbing pulse can be substituted for ( It ) / CM , or D V, in our formulation. Thus, the nal expression to be used for comparison with our results is dh V 0 (h ) D w C wi 0 2 , 0 dt V (h ) C w 2 (h )
(C.6)
Inuence of Limit Cycle Topology
1053
where i is instantaneous scaled current from equation C.5 and V0 (h ) is the 0 only surviving element from U k. Just as in our results, described below, this gives the phase resetting due to a perturbation by projecting the perturbation in voltage onto the limit cycle. However, the intuition behind this reasoning is described only in this article. The most logical way to compare our results with those of Ermentrout and Kopell is to use the product form of the coupling h k (hk, hj ) D P (h ) R (h ) and to set P (h ) D i and R (h ) D
V 0 (h ) . 0 V 2 (h ) C w 2 (h ) 0
(C.7)
This is not how Ermentrout and Kopell (1991a, b) dened R (h ) and P (h ) : they included the presynaptic voltage terms in R (h ) and all postsynaptic terms in P (h ). Since they used a synaptic current for the perturbation rather than a square pulse, the current term gex ( V (hj ) ) (Vex ¡ V (h k ) ) was not equivalent to R (h ) , and the P (h ) term included (Vex ¡ V (hk ) ) . On the other hand, we separated the terms into a term representing the dynamics near the original unperturbed limit cycle (R (h )) and a term representing the perturbation in voltage (P(h ) ). If you make the assumption that i is nonzero for a very brief period of time and that for simplicity the integral of i over this brief interval is 1 (like a delta function), then you can assume R (h ) is constant at its value at the time the perturbation is initiated. The h on the left-hand side of equation C.6 varies with a constant angular velocity over most of the limit cycle (everywhere except when it receives a perturbation; see Figure 11). The temporal phase resetting in response to an instantaneous pulse is then a scaled version of R (h ) (see Figure 11): F (h ) D ¡w¡1 R (h ) .
(C.8)
In our derivation, we dene a new quantity s, geometrical phase, that is distance along the limit cycle as a function of time. This function, unlike the temporal phase h (t) used by Ermentrout and Kopell (1991a) and by Hoppensteadt and Izhikevich (1997) has a variable derivative along the limit cycle, s D s ( t) D
1 L
Z
t 0
|| uP ( t) ||dt,
(C.9)
where L is the length of the limit cycle in the appropriate coordinate system and the dot indicates the derivative with respect to time. Thus, ds D k uP ( t) k dt. We show that for Morris-Lecar, ds D p
VP (t)
VP 2 ( t) C wP 2 ( t)
.
(C.10)
1054
Sorinel A. Oprisan and Carmen C. Canavier
Figure 11: Relationship of PRC F (h ) to a change in angular displacement, R (h ) . This is a plot of constant angular velocity of the temporal phase that indicates the resetting produced by Ermentrout and Kopell’s (1991a, 1991b) R (h ) .
From equation C.9, we see that for Morris-Lecar, q VP 2 ( t) C wP 2 ( t) dt.
ds D
(C.11)
Furthermore, the temporal phase resetting (w ) can be obtained using the inverse of s ( t) : dw D s¡1 (s) .
(C.12)
We can recover the result of Ermentrout and Kopell (1991a), for temporal phase resetting R (h ) given in equation C.8 by equating R (h ) with the instantaneous change in temporal phase dt due to a delta function perturbation and then substituting equation C.12 into C.11: dt D
V 0 (t) , 0 V 2 ( t) C w 2 (t) 0
(C.13)
noting that in Ermentrout and Kopell, time t and phase w are related by a constant. The expression given in equation C.13 is the adjoint, or innitesimal PRC, as given in equation 3.2 of Hansel, Mato, and Meunier (1995) and equation A21 of Ermentrout and Kopell (1991b), providing that variations in
Inuence of Limit Cycle Topology
1055
amplitude (perturbations normal to the limit cycle) are ignored. Izhikevich (2000) also obtained a similar result, but in the limit as the derivative of 0 the slow variable (w 2 (t) in C.13) goes to zero. Although previous work in the literature contains expressions that are mathematically similar to the ones presented in this article, they do not contain the geometrical intuition presented here. This intuition allows for the possibility of extending PRC analysis to type II model neurons and even to physiological type II neurons for which we do not know the equations of state but for which we can reconstruct an attractor. Acknowledgments
This work was supported by Whitaker Foundation grant TF-98-0033 and NSF grant IBN-0118117. References Abarbanel, H. D. I., Brown R., Sidorowich J. J., & Tsimring L. Sh. (1993). The analysis of observed chaotic data in physical systems. Rev. Mod. Phys., 65, (4), 1331–1392. Abramovich-Sivan, S., & Akselrod, S. (1998a). A PRC based model of a pacemaker cell: Effect of vagal activity and investigation of the respiratory sinus arrhythmia. J. Theor. Biol., 192, 219–243. Abramovich-Sivan, S., & Akselrod, S. (1998b). A phase response curve based model: effect of vagal and sympathetic stimulation and interaction on a pacemaker cell. J. Theor. Biol., 192, 567–579. Arshavsky, Yu. I., Beloozerova, I. N., Orlovsky, G. N., Panchin, Yu, V., & Pavlova, G. A. (1985). Control of locomotion in marine mollusc Clione limacina I. Efferent activity during actual and ctitious swimming. Exp. Brain Res., 58, 255–262. Baxter, R. J., Canavier, C. C., Lechner, H. A., Butera, R. J., DeFranceschi, A. A., Clark, J. W., & Byrne, J. H. (2000). Coexisting stable oscillatory states in single cell and multicellular neuronal oscillators. In D. Levine, V. Brown, and T. Shirley (Eds.), Oscillations in neural systems (pp. 51–77). Hillside, NJ: Erlbaum. Beer, R. D., Chiel, H. J., & Gallagher, J. C. (1999). Evolution and analysis of model CPGs for walking II. General principles and individual variability. J. Comp. Neurosci., 7(2), 119–147. Canavier, C. C., Baxter, D. A., Clark, J. W., & Byrne, J. H. (1999). Control of multistability in ring circuits of oscillators, Biol. Cybernetics, 80, 87–102. Canavier, C. C., Butera, R. J., Dror, R. O., Baxter, D. A., Clark, J. W., & Byrne, J. H. (1997).Phase response characteristics of model neurons determine which patterns are expressed in a ring circuit model of gait generator. Biol. Cybernetics, 77, 367–380. Chiel, H. J, Beer, R. D., & Gallager, J. C. (1999). Evaluation and analysis of model CPGs for walking: I dynamical models. J. Comp. Neurosci., 7, 1–20.
1056
Sorinel A. Oprisan and Carmen C. Canavier
Collins, J. J., & Richmond, S. A. (1994). Hard-wired central pattern generators for quadrupedal locomotion. Biol. Cyber., 71, 375–385. Collins, J. J., & Stewart, I. N. (1993). Coupled nonlinear oscillators and the symmetries of animal gaits. J. Nonlin. Sci., 3, 349–392. Dror, R. O., Canavier, C. C., Butera, R. J., Clark, J. W., & Byrne, J. H. (1999). A mathematical criterion based on phase response curves for the stability of a ring network of oscillators. Biol. Cybern., 80, 11–23. Ermentrout, G. B. (1985). The behavior of rings of coupled oscillators. J. Math. Biol., 23, 55–74. Ermentrout, G. B. (1986). Parabolic bursting in an excitable system coupled with a slow oscillation. SIAM J. Appl. Math., 46, 233–253. Ermentrout, G. B. (1994). Neural modelling and neural networks. Oxford: Pergamon Press. Ermentrout, G. B. (1996). Type I membranes, phase resetting curves, and synchrony. Neural Computation, 8, 979–1001. Ermentrout, G. B., & Kopell, N. (1991a). Oscillator death in systems of coupled neural oscillators. SIAM J. Appl. Math., 29, 195–217. Ermentrout, G. B., & Kopell, N. (1991b). Multiple pulse interactions and averaging in coupled neural oscillators. J. Math. Biol., 50(1), 125–146. FitzHugh, R. (1969). Mathematical models of excitation and propagation in nerve. In H. P. Schwan (Ed.), Biological engineering (pp. 1–10). New York: McGraw-Hill. Foss, J., & Milton, J. (2000). Multistability in recurrent neural loops arising from delay. J. Neurophysiology, 84, 975–985. Glass, L., & Mackey, M. (1988). From clocks to chaos: The rhythms of life. Princeton, NJ: Princeton University Press. Golubitsky, M., Stewart, I., Buono, P. -L., & Collins, J. J. (1998). A modular network for legged locomotion. Physica D, 115, 56–72. Guckenheimer, J., & Holmes, P. (1983). Nonlinear oscillations, dynamical systems, and bifurcation of the vector elds. Berlin: Springer-Verlag. Hansel, D., Mato, G., & Meunier, C. (1995). Synchronization in excitatory neural networks. Neural Computation, 7, 307–337. Hodgkin, A. L. (1948).The local electric changes associated with repetitive action in a non-medullated axon. J. Physiology, 107, 165–181. Hoppensteadt, F. C., & Izhikevich, E. M. (1997). Weakly connected neural networks. New York: Springer-Verlag. Izhikevich, E. M. (2000). Phase equations for relaxation oscillators. SIAM J. Appl. Math., 60, 1789–1804. Kopell, N., & Ermentrout, G. B. (1988). Coupled oscillators and the design of central pattern generators. Math. Biol., 90, 87–109. Mirollo, R. I., & Strogatz, S. H. (1990). Synchronization of pulse-coupled biological oscillators. SIAM J. Appl. Math., 50, 1645–1662. Morris, C., & Lecar, H. (1981). Voltage oscillations in the barnacle giant muscle ber. Biophys. J., 35, 193–213. Murray, J. D. (1993). Mathematical biology (2nd ed.). New York: Springer-Verlag. Nishii, J. (1999). Learning model for coupled neural oscillators. Network: Comp. Neural Syst., 10, 213–226.
Inuence of Limit Cycle Topology
1057
Oprisan, S. A., & Canavier, C. C. (2000a).Phase response curve via multiple time scales analysis of the limit cycle behavior of type I and type II excitability. Biophys. J., 78, 218A. Oprisan, S. A., & Canavier, C. C. (2000b). Analysis of central pattern generator circuits using phase resetting. Soc. Neurosci. Abstr., 26, 2000. Oprisan, S. A., & Canavier, C. C. (2001). The structure of instantaneous phase resetting in a neural oscillator. InterJournal of Complex Systems, manuscript 386. Available on-line at: www.interjournal.org/. Pavlidis, T. (1973). Biological oscillators: Their mathematical analysis. Orlando, FL: Academic Press. Pinsker, H. M. (1977). Aplysia bursting neurons as endogenous oscillators: I: Phase-response curves for pulsed inhibitory synaptic input. J. Neurophysiology, 40, 527–543. Rinzel, J., & Ermentrout, G. B. (1998). Analysis of neural excitability and oscillations. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling: From ions to networks (pp. 251–292). Cambridge, MA: MIT Press. Rinzel, J., & Lee, Y. S. (1986). On different mechanisms for membrane potential bursting. In H. G. Othmer (Ed.), Nonlinear oscillations in biology and chemistry. New York: Springer-Verlag. Takens, F. (1981). Detecting strange attractors in turbulence. In D. A. Rand & L. S. Young (Eds.), Proceedings of the Symposium on Dynamical Systems and Turbulence. Berlin: University of Warwick. Traub, R. D., & Miles, R. (1991). Neural networks of the hippocampus. Cambridge: Cambridge University Press. Wilson, M. A., & Bower, J. M. (1989). The simulation of large scale neural networks. In Methods in Neural Modeling. Cambridge, MA: MIT Press. Winfree, A. (1980). The geometry of biological time. New York: Springer-Verlag. Received March 5, 2001; accepted July 25, 2001.
LETTER
Communicated by Steven Nowlan
Fields as Limit Functions of Stochastic Discrimination and Their Adaptability Philip Van Loocke
[email protected] Lab for Applied Epistemology, University of Ghent, 9000 Ghent, Belgium For a particular type of elementary function, stochastic discrimination is shown to have an analytic limit function. Classications can be performed directly by this limit function instead of by a sampling procedure. The limit function has an interpretation in terms of elds that originate from the training examples of a classication problem. Fields depend on the global conguration of the training points. The classication of a point in input space is known when the contributions of all elds are summed. Two modications of the limit function are proposed. First, for nonlinear problems like high-dimensional parity problems, elds can be quantized. This leads to classication functions with perfect generalization for high-dimensional parity problems. Second, elds can be provided with adaptable amplitudes. The classication corresponding to a limit function is taken as an initialization; subsequently, amplitudes are adapted until an error function for the test set reaches minimal value. It is illustrated that this increases the performance of stochastic discrimination. Due to the nature of the elds, generalization improves even if the amplitude of every training example is adaptable. 1 Introduction
In recent years, different types of random discrimination methods for classication have been studied (Kleinberg, 1996, 2000; Ho, 1995; Berlind, 1994; Van Loocke, 2000, 2001). Such methods generate a huge number of elementary functions (which correspond to units in the present context) to dene a classication. They run counter to the intuition that generalization is best when the number of units in a network is kept relatively low. The philosophy behind this approach is that the errors generated by an individual function are smoothed out by the other functions. Every individual function has a poor performance on a problem. Therefore, random discrimination methods have been called weak methods. Each time a new function is included, however, the statistical properties of the sum of the functions tend to improve. On an intuitive level, the statistical fact behind the effectiveness of random discrimination can be formulated as follows. Consider a training set c 2002 Massachusetts Institute of Technology Neural Computation 14, 1059– 1070 (2002) °
1060
Philip Van Loocke
with N input-output pairs, and suppose that the vector v ´ (v ( u1 ) , . . . , v (uN ) ) is the vector composed of all output values; v ( ui ) is the output value corresponding to the ith training input ui . The input space is denoted I and has n dimensions. Since we examine only classication problems in this article, the output space O can be assumed to be one-dimensional.1 Suppose that a set S of vectors with positive correlation with v is generated. Then if S is sufciently large and if the vectors of S are randomly distributed around v, v will be the average of the vectors in S. Suppose that a set of elementary functions f (with f : I ! O) is considered. For every elementary function f , the vector v f composed of the output values for all training instances can be formed. If the vectors v f have positive correlation with v and if they are randomly distributed around v, the average of the elementary functions will converge to v when the number of elementary functions increases. When this supposition is turned into a practical method, a series of functions so that v f has varying correlation with v is generated. Every time a function is met for which the correlation between v f and v is positive, the function is selected, and v f is put in the set S. The latter is extended until a satisfying classication is obtained. The mathematics of random discrimination has been studied in detail by Kleinberg (1996, 2000). We recapitulate only the core mathematical fact. Suppose that a classication problem divides an input I space in two classes A and B. Suppose that A and B correspond to output 1 and 0. Consider an elementary function f that denes a partition in two classes Af and B f . A point q 2 I is mapped on 1 by f if q 2 A f ; it is mapped on 0 if q 2 B f . Suppose that different functions f are generated in such a way that two criteria are met: (1) if a point q belongs to A, then p (q 2 Af ) > p ( q 2 B f ) , and (2) when different functions f are considered, the chance that a point q is a member of A f is independent of q (meaning that the parts receiving output value 1 cover I uniformly when different functions f are considered), and suppose that the same is true for B f . Then the variable X ( q, f ) D ( f ( q) ¡p(q 2 A f | q 2 B ) ) / ( ( q 2 A f | q 2 A) ¡p(q 2 A f | q 2 B ) ) has average 1 for a point q belonging to A and average 0 for a point belonging to B (see Kleinberg, 1996, 2000). It follows from the central limit theorem that the sum over different elementary functions f , Y ( q) D SfkD 1 to tg X ( q, fk ) / t, is normally distributed with average 1 if q belongs to A and with average 0 if q belongs to B. The variance of this variable is 1 / t times the variance of X ( q, f ) , so that for high values of t, Y ( q) can be considered as the output of an efcient classier system. 1 We conne ourselves to examples that are binary classication problems. The generalization of this situation to classications in more than two classes is straightforward and can be carried out as described in Kleinberg (2000).
Fields as Limit Functions
1061
Suppose that the input space I is discrete and that an elementary function f randomly partitions I in two classes, A f and B f . Suppose that a function is sampled if the number of vectors belonging to A f that are in A is larger than the number of input vectors of B f that are in A. For random partitions, the uniformity condition is met, so that sampling and averaging lead to an exact classication of all training examples. But random functions do not generalize, since for two points q and q0 , which do not both belong to the training set, f (q ) and f ( q0 ) are uncorrelated, however close they are. Therefore, random discrimination methods use elementary functions that are smooth. As a consequence, the uniformity condition is generally not met. Kleinberg (1996, 2000) has shown that for some choices of f , weakening this condition still leads to good performance on training sets while allowing for good generalization properties. In Kleinberg’s work, the functions f take the following form.2 Consider a point q in I that belongs to a training couple. A rectangular parallelopiped is constructed as follows. For every dimension, a random number between q and the maximal value for this dimension and a random number between q and the minimal value for this dimension is generated. The parallelopiped is obtained as the set of points of I with coordinates that are between these numbers for all dimensions. The parallelopiped and its complement dene A f and B f . Different functions result for different choices of q and the random numbers that dene the boundaries of the parallelopipeds. In this article, we introduce a different set of elementary functions. Basically, they are dened by randomly selecting a number of training vectors and allowing these vectors to split up the input space in terms of a nearestneighbor condition. The special feature of these functions is that when the number of functions sampled increases, the average function can be written in a relatively simple analytical form L ( q). This form can be used directly to perform the classication task. Basically, the classication of an input vector q requires one to rank all training vectors ui in terms of their distance to q. A power of this rank determines the contribution of ui to the output of q (see section 2). Then we consider two possibilities to modify the form L ( q) . The rst is aimed at problems that are strongly nonlinear. It adds a factor to L ( q) that accounts for the spherically differentiated predictive value of a training vector. For every training vector u, the input space is divided in concentric hyperspheres with center u and with increasing radius, and the relative number of training vectors with the same classication as u for the region between two successive hyperspheres is determined. This number determines the extent to which u is allowed to contribute to the outputs of the points located in this region (see section 3). As a second modication of the algorithm, every 2 In Van Loocke (2000, 2001), elementary functions of a sigmoid type have been shown to be efcient at nonlinear problems such as high-dimensional parity problems. These functions were applied to binary problems only.
1062
Philip Van Loocke
training item receives an amplitude. The amplitude species the weight that is given to the training item in the limit function. Initially, all amplitudes are put to one. An incremental procedure is used to search for amplitudes that minimize the error on the training set. The nature of the limit function guards for overtraining, even if the number of parameters that can be adapted equals the number of training vectors (see section 4). The resulting classication method is applied to two types of problems. Section 5 considers two generative binary problems: an eight-dimensional linear problem and an eight-dimensional parity problem. It is shown that the method gives perfect classications for training sets and also generalizes perfectly, even in case of parity problems. In the latter case, perfect generalization is obtained without the need to adapt amplitudes. In section 6, the method is applied to two real-world data sets. When compared to the 10 random discrimination methods evaluated in Kleinberg (2000), the method increases performance for both data sets. Intuitively, the approach explored here considers every training example as the source of a eld in the input space. In every point q of the input space, this eld is determined by the output associated with a source and by a power of the local rank of the source. The resulting quantities are summed, which yields the total local eld in q. If it is suspected that the resulting classier is not optimal yet, then the elds are provided with amplitudes, and these are adapted by an incremental procedure that minimizes an error function. Section 7 briey discusses this intuition. 2 A Limit Function for a Particular Type of Stochastic Discrimination
We dene an elementary function f as follows. Select randomly k training items ( uis , v ( uis ) ) from the training set (s D 1, . . . , k) ; uis has n continuous or discrete components, and v ( uis ) is a binary number (k < N; as explained below, different values of k result in different types of limit functions). From this section on, the binary class-specifying values are +1 and ¡1 instead of 1 and 0. Then f is determined by a nearest-neighbor condition: for every point q of the input space, the point uc in fui1 , . . . , uikg that is closest to q is determined, and f is dened by f ( q) D v ( uc ) . If k D 2, then f draws a hyperplane in the input space. The hyperplane is orthogonal to the line connecting ui1 and ui2 and passes through the middle of the line segment [ui1 ui2 ]. All points on the side of ui1 receive the same output as ui1 , and the output of all points on the side of ui2 is v ( ui2 ). For k D 3, the input space is divided into three parts with linear separation, and so on. A function f in general does not necessarily obey the criterion that more items of A are mapped on +1 than on ¡1 or that more items of B are mapped on ¡1 instead of on +1. (A and B are dened as in section 1: A is the set of input training points mapped on +1, and conversely for B). The function f maps k training points correctly with certainty. The extent to which the other N ¡ k training points are mapped correctly depends on the conguration of the latter in
Fields as Limit Functions
1063
the input space. The higher k is, the higher is the chance that f obeys the criterion that at least half of the training items are classied correctly. We include all functions in the sampling procedure even though this criterion may occasionally not be obeyed for low values of k. In this sense, we are more lenient for our sampling functions than in the situation described in section 1. Consider a point q in the input space and a point u that appears as the rst argument of a training couple. Suppose that a large set of functions f is sampled for a xed value of k. By denition, the chance r (q, u ) that q is directed by u is the chance that u is the point used to specify the output of q. This chance is equal to the chance that u appears in a selected set fui1 , . . . , uik g and that u is the point closest to q of these k points. The former chance is the same for every point u and will be denoted C. In order to obtain the chance that a point u in the set fui1 , . . . , uikg is the closest point to q, it can be assumed without loss of generality that u coincides with ui1 . The chance that q is closer to ui1 than to ui2 and closer to ui1 than to ui3 . . . and closer to ui1 than uik is equal to the (k ¡ 1)th power of the chance that q is closer to ui1 than to an arbitrary other point of the training set. Suppose that input points of the training set are ranked in terms of their distance to q. Suppose that u is the point of the training set that is the rth closest to q. This number is a function of q and u: r ´ r ( q, u ) . Suppose that all training points have different distance to q. Then the training set has N C 1 ¡ r points that are at least as far from q as u, so that the chance that u is the closest point is equal to (N C 1 ¡ r (q, u ) ) / N. Hence, we obtain r ( q, u ) D C £ ( (N C 1 ¡ r (q, u ) ) / N ) k¡1 .
For this type of elementary function f , a point q of the input space is directed by a single training point only. Hence, the chance that q is directed by a point belonging to A is equal to the sum of the chances that it is directed by the individual elements of A; the same is true for B. Hence: p ( f ( q) D 1) D Sfu2Ag r (q, u )
and
p ( f ( q) D ¡1) D Sfu2Bg r (q, u ) .
Suppose now that a very large number of functions f is sampled and that the sign of their average determines the classication of q. Then q will be mapped on 1 if p ( f ( q) D 1) > p ( f ( q) D ¡1) ; else, it will be mapped on ¡1. This can be written in a slightly different form: L (q) ´ sgn < f > takes the form L (q ) D 1 L (q ) D ¡1
Sfu2Agr ( q, u) ¡ Sfu2Bgr ( q, u) > 0
if if
Sfu2Agr ( q, u) ¡ Sfu2Bgr ( q, u) < 0.
Since for points u of A, v ( u ) D 1, and conversely for B, these expressions can be summarized in a single expression: L (q ) D sgn ( Su r ( q, u ) £ v (u ) ) ,
1064
Philip Van Loocke
where the sum ranges over the entire training set. Due to the presence of the signum function, the constant C in the expression for r ( q, u ) can be omitted, and we obtain L ( q) D sgn ( Su v ( u ) £ ( ( N C 1 ¡ r (q, u ) ) / N ) k¡1 ). Different values of k result in different limit functions. The above expression entails that the higher k is, the stronger is the inuence exerted by the closest training points on q. Therefore, if k is taken too high, L ( q) acts too much like an ordinary nearest-neighbor classier, and congurations of training points in the broader neighborhood of q cannot exert inuence on the classication. On the other hand, if k is given the lowest possible value (k D 2), then the functions f of which L (q ) is the limit function have a higher chance of not being weak classiers. As illustrated in sections 5 and 6, k D 3 appears to be an adequate compromise between these situations. Instead of generating and averaging high numbers of elementary functions f , one can turn directly to the limit function L ( q) to perform a classication. This yields a more accurate application of this method, and it also signicantly reduces computation time if large numbers of elementary functions are needed. In the context of this paper, an equally important fact about having an explicit expression for L ( q) is that it can be modied in two natural ways that can lead to more optimal classications. 3 Radial Differentiation and Quantization of Spherical Fields
The expression for L ( q) can be read as follows. Each point u in I that belongs to a training item exerts an inuence in all points of the input space. This inuence consists of a positive stimulation to take the same output as u. The point u can be considered the source of this inuence. If the training set has a uniform distribution around u, then the inuence of u decreases as a function of the distance only. If this condition does not hold, u may exert high inuence over a long range in one direction and have faster decaying inuence in another direction if the latter has a higher density of training points. The function L ( q) is limited in its ability to grasp strongly nonlinear structure. Consider, for instance, a parity problem. If a (binary) point receives the strongest stimulation from its closest (binary) neighbors, then L will have a strong tendency to produce the wrong output, since these neighbors have different parity. For this type of problem, L ( q) can be complemented in a natural way with a factor that corresponds with the spherical quantization of the elds. Suppose that for every training point u, r concentric hyperspheres are considered, with an increasing diameter given by s £ m , where m is the diameter of the smallest hypersphere (s D 1, . . . , v ) . The relative number of items in the smallest hypersphere with the same classication as u is denoted k ( u, 1). The relative number of items located between the s ¡ 1th
Fields as Limit Functions
1065
and the sth hypersphere and with the same classication as u is k (u, s) . The latter number can be used as a quality score for the contribution of u to the sth spherical band around u (if the sth band has no training items, the quality score is put equal to 0.5). Then the following modication of L (q ) results. For every point q of the input space that has to be classied and for every training point u, the spherical band around u to which q belongs is determined. The contribution of u to the classication of q is weighted by the quality score of this band. Suppose that q is located in the s ( q) th band around u. Then the function L ( q) can be modied into K ( q) with K (q ) D sgn ( Su v ( u ) £ k (u, s ( q) ) £ ( ( N C 1 ¡ r ( q, u) ) / N ) k¡1 ) . The introduction of the factor k ( u, s ( q) ) allows this method to grasp structural properties that are reducible to the sums of spherical regularities around training points. The factor can be seen as a coefcient in an analysis in terms of spherical bands. Unlike Fourier analysis, for instance, the coefcients multiply not orthogonal functions but another factor that codetermines the weight of v ( u ). Intuitively, u becomes the source of a spherically symmetric eld that is quantized in r bands but is directionally invariant. The eld multiplies the nonsymmetrical factor derived in the previous section. There are other ways in which L can be modied with spherically dened differentiation. Consider a noisy training point u in the input space. For such a point, the closest different training point with different output value will on average be closer than for nonnoisy training points. Therefore, one can determine the largest hypersphere for which the training points have the same output as u and increase the inuence of u within this hypersphere. Then, nonnoisy points will tend to be associated with larger hyperspheres in which they exert increased inuence. This modication can be stated formally as follows. Suppose that w ( u, q) takes the value 1 for every point q outside the hypersphere associated with u and value a with a > 1 for points inside the hypersphere. Then one can consider the alternative classication function K0 ( q) with K0 (q ) D sgn( Su v ( u ) £ w ( u, q) £ ( ( N C 1 ¡ r ( q, u ) ) / N ) k¡1 ) The enhancement of K0 ( q) relative to L ( q) should be expected to be limited. In effect, L ( q) itself suppresses noise, since in every region that corresponds clearly to one of both classes, noisy points tend to be more sparsely distributed than nonnoisy points, so that their inuence is often canceled by the majority of nonnoisy points in the neighborhood of this region. This fact turns out to be true for the two practical examples studied in section 6 and for which K0 (q ) yields only an occasional and small improvement relative to L ( q) .
1066
Philip Van Loocke
4 Adaptable Amplitudes
For xed values of k, m , and a, the functions L ( q) , K ( q) , and K0 ( q) are not adaptable. They do not contain parameters that can be systematically varied in order to enhance classications. A simple way to insert parameters is to provide every training example with an amplitude a ( u ) . Then for K ( q) , the new classication function reads MK (q ) D sgn ( Su v ( u ) £ a (u ) £ k ( u, s (q ) ) £ ( ( N C 1 ¡ r ( q, u ) ) / N ) k¡1 ) ,
and similar for L ( q) and K0 (q ) (the versions of the latter with modiable amplitudes are denoted ML ( q) and MK 0 (q )). Suppose that initially all amplitudes are put to one. Then the initial classication function MK ( q) coincides with K ( q) . Subsequently, amplitudes are allowed to deviate from one if this enhances the performance on the training set. This procedure can be continued until the classication error on a test set starts to increase. The algorithm by which amplitudes are adapted requires introducing a continuous error function as well as continuous outputs. For this end, the sign function in the expressions for MK ( q) , ML (q ), and MK 0 ( q) is replaced by the familiar squashing function f (x ) D 1 / (1 C exp(¡e £ x ) ) ; this yields in case of MK ( q) the continuous value C ( q) : C (q ) D 1 / (1 C exp(¡e £ Su v ( u ) £ a ( u ) £ k (u, s ( q) ) £ ( ( N C 1 ¡ r ( q, u ) ) / N ) k¡1 ) ) .
Then an error function is dened for the training set by E D Su (v ( u ) ¡ C ( u ) ) 2 . In the examples in sections 5 and 6, the amplitudes are adapted according to a simple random schema. At every time step, l amplitudes are randomly selected. A random number between ¡k and k is added to each of the amplitudes, and if this leads to a lower value for the error function, the change in amplitudes is effectively carried through (this corresponds to a Hooke-Jeeves search in a classical optimization context; see Reklaitis, Ravindran, & Ragsdell, 1983). As will become apparent in the examples, the insertion of amplitude parameters leads to better generalizations. At rst sight, this may seem counterintuitive because of the relatively large number of parameters that are inserted. In case of a backpropagation network, for instance, the total number of connections and biases is usually kept signicantly below the total number of training examples. Networks with too many units tend to generalize poorly because they overt the training data. Consider a network with at least one hidden layer. For every unit that is added to the rst hidden layer, another hyperplane separation of the input space is included in the function calculated by the network. Therefore, the more units there are at this layer, the higher is the chance that neighboring points in input space are mapped on output values corresponding to different classication. Nonlinear functions calculated by networks with at least one hidden
Fields as Limit Functions
1067
layer become less smooth and more volatile when more units are included (Reed & Marks, 1999). For the present method, the situation is different. If amplitudes are made to differ from one, the fact that changes is that some training points are allowed to exert higher inuence, whereas others will have lower impact on their environment. This fact does not introduce additional boundaries (like hyperplanes in case of backpropagation) in the classication function, so that the volatility of the latter does not necessarily increase. 5 Illustration with Two Generative Problems
In this section, we conne ourselves to a binary eight-dimensional input space. For every dimension, the possible values are +1 and ¡1. The input space contains 28 D 256 different points. To start, consider a linear separation problem. For every element u ´ ( u1 , u2 , . . . , u8 ) of the training set, the output value v ( u ) is determined in accordance with the linear functional prescription w (u ) D sgn ( SfiD 1 to 8g ci ui ) , where ci are random numbers between ¡1 and +1. The input space is randomly divided into a training set and a test set of equal size. The classication function ML ( q) with adaptable amplitudes was used to solve this problem. In the run illustrated in Figures 1 and 2, L (q) initially makes 11 errors on the training set and 9 errors on the tests set. The initial values for E are 112.5 and 112.9 on the training set and the test set, respectively. After 18,000 iterations of the adaptation procedure for amplitudes, the error decreased to 8.8 on the training set and 20.3 on the test set (see Figure 1). Classication was perfect on both the training set and the test set after 9900 iterations (see Figure 2) (all simulations were made for k D 2). This illustrates that noiseless, linear problems are easily solved by the present method. The initial classication errors produced by L ( q) are typically lower than 10%, and amplitude adaptation makes them steadily decrease to zero. Next, consider an eight-dimensional parity problem. We use K ( q) to solve this problem (the parameter m was put on 0.06). When the input space was randomly divided in a training set and a test set of equal size, K (q ) appeared to perform perfectly on both the training set and the test set, so that no further amplitude adaptation was required. The fact that this method generalizes on parity problems suggests that it can grasp relations of a structural kind and that it is not conned to noise suppression. 6 Illustration with Two Practical Problems
We applied the ML ( q) classication procedure on two practical problems that were also evaluated in Kleinberg (2000). First, we took a credit scoring problem for which 690 examples are provided. This total number of examples was divided randomly 10 times in a training set and a test set of equal
1068
Philip Van Loocke
Figure 1: Evolution of E for the training set and the test set (e D 0.005, k D 2, and l D 5).
Figure 2: Evolution of classication errors on the training set and the test set.
Fields as Limit Functions
1069
size. The parameters values for our runs were e D 0.01, k D 2, and l D 5. The smallest fractions of erroneously classied test items for each of the 10 partitions were 0.1101, 0.1043, 0.1101, 0.113, 0.1217, 0.113, 0.1391, 0.1217, 0.1072, and 0.1217, which gives an average of 0.1162. This is to be compared with the fractions of errors for the 10 random discrimination methods investigated in Kleinberg (2000) (these fractions reect averages of 10 separations of the example set in a training and a test set too), which were between 0.124 and 0.158. No further increase in performance was noticed when ML (q ) was replaced by MK ( q) or MK0 ( q) . Second, ML ( q) was applied to a diabetes classication problem. The total number of examples was 768, and again this set was split 10 times in a training and a test set. The parameter values for all our runs were again e D 0.01, k D 2, and l D 5. The smallest values of the fractions of errors on the test set were 0.2057, 0.2187, 0.2395, 0.2135, 0.2317, 0.2369, 0.2239, 0.2265, 0.2473, and 0.2213, with an average of 0.2265. This is to be compared with the values for the 10 methods studied by Kleinberg (2000), which were between 0.244 and 0.284. For this problem, MK 0 ( q) occasionally appears to give a lower errror for a D 2, but the average effect is marginal and of the order of 0.3%. Also this time, turning to MK ( q) did not increase the performance, which suggests that these problems do not contain strong nonlinearities. For these two practical examples, ML ( q) and MK 0 ( q) appear to perform better than random discrimination methods. The initial values of the errors (i.e., the values produced by L ( q) and K0 (q )) are typically slightly higher than the values produced by random discrimination.3 But this is countered by the fact that the present method allows for adaptation and that adaptation generally does not harm the smoothness of the output function, so that generalization properties are enhanced when the performance on a training set is enhanced. 7 Discussion
Conceptually, the classication function L ( q) has a straightforward relation to random discrimination methods. But the limit function has also an interpretation in terms of elds. Every training example is the source of a eld. The strength of a eld that originates at an example decays as a function of the conguration of all other examples. For a uniform distribution of other examples in the neighborhood of an example, the eld produced by
3 The main reason for this is as follows. Suppose that a sampling procedure is repeated several times and with nite numbers of elementary functions. Then the average of the sampled functions can show statistical uctuations around the theoretical average for an innite number of elementary functions. If the run that corresponds with best performance is selected, better values than for the theoretical limit function may be obtained. This was the procedure followed in Kleinberg (2000), where 10 runs were made for each division of the example set in a training set and a test set.
1070
Philip Van Loocke
the latter decays with radial symmetry. Symmetry or lack of symmetry of examples in I relative to a particular example is directly reected in the eld emerging from the latter. Once the sum of the elds in a point is calculated, its classication is known. The strength of elds can be optimized in terms of an error function. In terms of procedure, ML ( q) , MK (q ), and MK 0 (q ) can be seen as incremental methods that start from a fairly good initialization, corresponding to L ( q) , K (q ), and K0 (q ), respectively. The quality of the initialization entails that the incremental procedure to be followed can be kept simple. The adaptability of amplitudes is an advantage relative to random discrimination methods. Further, the straightforward quantization of elds in K (q ) yields a method that solves highly nonlinear problems like parity problems with remarkable efciency. Generalization for such problems does not require further adaptation of amplitudes. This contrasts with backpropagation. In the classic parallel distributed processing work (Rumelhart, Hinton, et al., 1986), a network conguration with symmetry that solves parity problems is shown, but for high-dimensional problems, backpropation generally does not nd it. (In our experience, it does not nd a solution at all for dimensionalities equal to eight or higher.) References Berlind, R. (1994).An alternativemethod of stochasticdiscriminationwith applications to pattern recognition. Unpublished doctoral dissertation, State University of New York at Buffalo. Ho, T. (1995). Random decision forests. In Proceedings of the Third International Congress on Document Analysis and Recognition (pp. 272–282). Montreal, Canada. Kleinberg, E. (1996). An overtraining-resistant stochastic modeling method for pattern recognition. Annals of Statistics, 24, 2319–2349. Kleinberg, E. (2000). On the algorithmic implementation of stochastic discrimination. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 473– 490. Reed, R., & Marks, R. (1999). Neural smithing: Supervised learning in feedforward articial neural networks. Cambridge, MA: MIT Press. Reklaitis, G., Ravindran, A., & Ragsdell, K. (1983).Engineering optimization:Methods and applications. New York: Wiley. Rumelhart, D., Hinton, G., & Williams, R. (1986). Learning internal representations by error propagation. In D. Rumelhart & J. McClelland (Eds.), Parallel distributed processing (Vol. 1). Cambridge, MA: MIT Press. Van Loocke, P. (2000). Fractals in cellular systems as solutions for cognitive problems. Fractals, 8(1), 7–14. Van Loocke, P. (2001). Meta-patterns and higher-order meta-patterns in cellular systems. Articial Intelligence Review, 16, 49–60. Received September 2000; accepted July 27, 2001.
LETTER
Communicated by Shimon Edelman
Learning to Recognize Three-Dimensional Objects Dan Roth
[email protected] Ming-Hsuan Yang
[email protected] Narendra Ahuja
[email protected] Department of Computer Science and the Beckman Institute Urbana, IL 61801, U.S.A.
A learning account for the problem of object recognition is developed within the probably approximately correct (PAC) model of learnability. The key assumption underlying this work is that objects can be recognized (or discriminated) using simple representations in terms of syntactically simple relations over the raw image. Although the potential number of these simple relations could be huge, only a few of them are actually present in each observed image, and a fairly small number of those observed are relevant to discriminating an object. We show that these properties can be exploited to yield an efcient learning approach in terms of sample and computational complexity within the PAC model. No assumptions are needed on the distribution of the observed objects, and the learning performance is quantied relative to its experience. Most important, the success of learning an object representation is naturally tied to the ability to represent it as a function of some intermediate representations extracted from the image. We evaluate this approach in a large-scale experimental study in which the SNoW learning architecture is used to learn representations for the 100 objects in the Columbia Object Image Library. Experimental results exhibit good generalization and robustness properties of the SNoW-based method relative to other approaches. SNoW’s recognition rate degrades more gracefully when the training data contains fewer views, and it shows similar behavior in some preliminary experiments with partially occluded objects.
1 Introduction
The role of learning in computer vision research has become increasingly signicant. Statistical learning theory has had an inuence on many applic 2002 Massachusetts Institute of Technology Neural Computation 14, 1071– 1103 (2002) °
1072
D. Roth, M.-H. Yang, and N. Ahuja
cations: classication and object recognition, grouping and segmentation, illumination modeling, scene reconstruction, and others. The rising role of learning methods, made possible by signicant improvements in computing power and storage, is largely motivated by the realization that explicit modeling of complex phenomena in a messy world cannot be done without a signicant role of learning. Learning is required for model and knowledge acquisition, as well as to support generalization and avoid brittleness. However, many statistical and probabilistic learning methods require making explicit assumptions, for example, on the distribution that governs the occurrences of instances in the world. For visual inference problems such as recognition, categorization, and detection, making these assumptions seems unrealistic. This work develops a distribution-free learning theory account to an archetypical visual recognition problem: object recognition. The problem is viewed as that of learning a representation of an object that, given a new image, is used to identify the target object in it. The learning account is developed within the probably approximately correct (PAC) model of learnability (Valiant, 1984). This framework allows us to quantify success relative to the experience with previously observed objects, without making assumptions on the distribution, and study the theoretical limits of what can be learned from images in terms of the expressivity of the intermediate representation used by the learning process. That is, learnability guarantees that objects sampled from the same distribution as the one that governs the experience of the learner are very likely to be recognized correctly. In addition, the framework gives guidelines to developing practical algorithmic solutions to the problem. Earlier work discussed the possibility of identifying the theoretical limits of what can be learned from images (Shvaytser, 1990) and found that learning in terms of the raw representation of the images is computationally intractable. Attempts to explain this focused on the dependence of learnability on the representation of the object (Edelman, 1993) but failed to provide a satisfactory explanation for it or a practical solution. The approach developed here builds on suggestions made in Kushilevitz and Roth (1996) and relies heavily on the development of a feature-efcient learning approach (Littlestone, 1988; Roth, 1998; Carlson, Cumby, Rosen, & Roth, 1999). This is a learning process capable of learning quickly and efciently in domains in which the number of potential features is very large, but any concept of interest actually depends on a fairly small number of them. At the heart of the learning approach are two assumptions that we abstract as follows: Representational: There exists a (possibly innite) collection M of “expla-
nations” such that an object o can be represented as a simple function of elements in M.
Learning to Recognize Three-Dimensional Objects
1073
Procedural: There exists a process that given an image in which the target
object o occurs, efciently generates “explanations” in M that are present in the image and such that, with high probability, at least one of them is in the representation of o. Under these assumptions, we prove that there exists an efcient algorithm that, given a collection of images labeled as positive or negative examples of the target object, can learn a good representation of the object. That is, it can learn a representation that with high probability would make correct predictions on future images that contain (or do not contain) the object. Furthermore, we show that under these conditions, the learned representations are robust under realistic noisy conditions. A signicant nonassumption of our approach is that it has no prior knowledge of the distribution of images nor it is trying to estimate it. The learned representation is guaranteed to perform well when tested on images sampled from the distribution1 that governed the data observed in its learning experience. Section 2 describes this framework in detail. The framework developed here is very general. The explanations alluded to above can represent a variety of computational processes and information sources that operate on the image. They can depend on local properties of the image, the relative positions of primitives in the image, and even external information sources or context variables. Thus, the theoretical support given here applies also to an intermediate learning stage in a hierarchical process. In order to generate the explanations efciently, this work assumes that they are syntactically simple in terms of the raw image. However, the explanation might as well be syntactically simple in terms of previously learned or inferred predicates, giving rise to hierarchical representations. The main assumptions of the framework are discussed in section 3, where we also provide some concrete examples to the type of representations used. For this framework to contribute to a practical solution, there needs to be a computational approach that is able to learn efciently (in terms of both computation and number of examples) in the presence of a large number of potential explanations. Our evaluation of the theoretical framework makes use of the SNoW learning architecture (Roth, 1998; Carlson et al., 1999) that is tailored for these kind of tasks. The SNoW system (available online at http://L2R.cs.uiuc.edu/˜cogcomp/) is described in section 4. This is followed by a comparison of SNoW with support vector machines in section 5. In section 6, we review and compare our method with previous works on view-based object recognition. We then describe our experiments comparing SNoW to other approaches in section 7. In these experiments, SNoW 1 It will be clear from the technical discussion that the distribution is over the intermediate representation (features) generated given the images.
1074
D. Roth, M.-H. Yang, and N. Ahuja
is shown to exhibit a high level of recognition and robustness. We nd that the SNoW-based approach compares favorably with other approaches and behaves more robustly in the presence of fewer views in the training data. We conclude with some directions for future work in section 8. 2 Learning Framework
We study learnability within the standard PAC model (Valiant, 1984) and the mistake-bound model (Littlestone, 1988). Both learning models assume the existence of a concept class C , a class of f0, 1g-valued functions over an instance space X with an associated complexity parameter (typically, X’s dimensionality) n, and some unknown target concept fT 2 C that we are trying to learn. In the mistake-bound model, an example x 2 X is presented at each learning stage; the learning algorithm is asked to predict fT (x ) and is then told whether the prediction was correct. Each time the learning algorithm makes an incorrect prediction, we charge it one mistake. We say that C is mistake-bound learnable if there exists a polynomial-time prediction algorithm A (possibly randomized) that for all fT 2 C and any sequence of examples is guaranteed to make at most polynomially many (in n) mistakes. We say that C is expected mistake-bound learnable if there exists A, as above, such that the expected number of mistakes it makes for all fT 2 C and any sequence of examples is at most polynomially many (in n). Note that the expectation is taken over the random choices made by A; no probability distribution is associated with the sequences. In learning an unknown target function fT 2 C in the PAC model, we assume that there is a xed but arbitrary and unknown distribution D over the instance space X. The learning algorithm sees examples drawn independently according to D together with their labels (positive or negative). Then it is required to predict the value of fT on another example drawn according to D. Denote by h ( x) the prediction of the algorithm on the example x 2 X. The error of the algorithm 6 h ( x ) g. with respect to fT and D is measured by error (h ) D Prx2D f fT ( x) D We say that C is PAC learnable if there exists a polynomial time learning algorithm A and a polynomial p (¢, ¢, ¢) such that for all n ¸ 1, all target concepts fT 2 C , all distributions D over X, and all 2 > 0 and 0 < d · 1, if the algorithm A is given p (n, 1 /2 , 1 /d) examples, then with probability at least 1 ¡ d, A’s hypothesis, h, satises error (h ) · 2 . It can be shown that if a concept class C is learnable in the expected mistake-bound model (and thus in the mistake-bound model), then it is PAC learnable (Haussler, Kearns, Littlestone, & Warmuth, 1988). The agnostic learning model (Haussler, 1992; Kearns, Schapire, & Sellie, 1994), a variant of the PAC learning model, might be more relevant to practical learning; it applies when one does not want to assume that the labeled training examples (x, l) arise from a target concept of an a priori known structure fT 2 C . In this model, one assumes that data elements (x, l) are sampled according to some arbitrary distribution D on X £ f0, 1g. D may
Learning to Recognize Three-Dimensional Objects
1075
simply reect the distribution of the data as they occur “in nature” without assuming that the labels are generated according to some “rule.” As in the PAC model, the goal is to nd an approximately correct h in some class H of hypotheses. In terms of generalization bounds, the models are similar, and therefore we will discuss the PAC/mistake-bound case here. In practice, learning is done on a set of training examples, and its performance is then evaluated on a set of previously unseen examples. The hope that a classier learned on a training set will perform well on previously unseen examples is based on the basic theorem of learning theory (Valiant, 1984; Vapnik, 1995); stated informally, it guarantees that if the training data and the test data are sampled from the same distribution, good performance on large enough training sample guarantees good performance on the test data (implying good “true” error),2 where the difference between the performance on the training data and that on the test data is parameterized using a parameter that measures the richness of the hypothesis class H . For completeness, we simply cite the following uniform convergence result: Theorem 1 (Haussler, 1992).
m (2 , d) D
1 2 2
If the size of the training sample S is at least
³ kVC ( H ) C ln 2
1
C ln
´ 1 , d
then with probability at least 1 ¡ d, the learned hypothesis h 2 H satises errorD (h ) < errorS (h ) C 2 , where k is some constant and VC ( H ) is the VC dimension of the class H (Vapnik, 1982), a combinatorial parameter that measures the richness of H. 2.1 Learning Scenario. Let I be an input space of images. Our goal is to learn a denition such as apple:I ! f0, 1g that when evaluated on a given image, outputs 1 when there is an apple in the image and 0 otherwise. It is clear, though, that this target function is very complex in terms of the input space; in particular, it may depend on relational information and even quantied predicates. Many years of research in learning theory, however, have shown that efcient learnability of complex functions is not feasible (Angluin, 1992). In the learning scenario described here, therefore, learning will not take effect directly in terms of the raw input. Rather, we will learn the target denitions in terms of an intermediate representation that will 2 In this sense, the evaluation done in section 7 after training on a small training set is not as optimal as, say, a face detection study (Yang, Roth, & Ahuja, 2000c) done on a large training set. However, the theory quanties the dependence of the performance on the size of the training data, and the experimental study exhibits how different algorithmic approaches fare with a relatively small number of examples.
1076
D. Roth, M.-H. Yang, and N. Ahuja
be generated from the input image. This will allow us to learn a simpler functional description, quantifying learnability in terms of the expressivity of the intermediate representation as well as the function learned on top of it; in particular, it would make explicit the requirements from an intermediate representation so that learning is possible. A relation3 over the instance space I is a function Â: I ! f0, 1g.  can be viewed as an indicator function over I , dening the subset of those elements mapped to 1 by Â. A relation  is active in I 2 I if  (I ) D 1. Given an instance, we would like to transform it so that it is represented as the collection of the relations that are active in it. We would like to do that, though, without the need to write down explicitly all possible relations that could be active in the domain ahead of time. This is important, in particular, over innite domains or in on-line situations where the domain elements are not known in advance and therefore it is impossible to write down all possible relations. An efcient way to do that is given by the construct of relation-generating functions (Cumby & Roth, 2000). Let X be an enumerable collection of relations over I . A relation generation function (RGF) is a mapping G: I ! 2X that maps I 2 I to a set of all elements in X that satisfy  ( I ) D 1. If there is no  2 X for which  ( I ) D 1, G (I) D w . Denition 1.
RGFs can be thought as a way to dene kinds of relations, or to parameterize over a large space of relations. Only when presented with an instance I 2 I is a concrete relation generated. For example, G may be the RGF that corresponds to all vertical edges of size 3 in an image. Given an image, a list of all these edges that are present in the image is produced. It will be clear in section 2.3 that RGFs can be noisy to a certain extent since the framework can tolerate some amount of noise at the features level. 2.2 Learning Approach. We now present a mistake-bound algorithm for a class of functions that can be represented as DNF formulas over the space X of all relations. As indicated, this implies a PAC learning algorithm, but the proof for the mistake-bound case is simpler. In section 7, we will learn a more general function—a linear threshold function over conjunctions of relations in X . We discuss later how the theoretical results can be expanded to this case.
Let X be a set of relations that can be generated by a set of RGFs. Let M be a collection of monomials (conjunctions) over the elements of X and Denition 2.
3 In the machine learning literature, a relation is sometimes called a feature. We use the term relation here to emphasize that it is a Boolean predicate and that, in principle, it could be a higher-order predicate, that is, it could take variables (Cumby & Roth, 2000). In this article, features are simple functions (e.g., monomials) over relations.
Learning to Recognize Three-Dimensional Objects
1077
p ( n ) , q (n ) , and g (n ) be polynomials. Let C M be the class of all functions that are disjunctions of at most p ( n ) monomials in M. Following Kushilevitz and Roth (1996), we call C M polynomially explainable if there exists an efcient (polynomial time) algorithm B such that for every function f 2 C M and every positive example of f as input, B outputs at most q ( n ) monomials (not necessarily all of them are in M) such that with probability at least 1 / g ( n ) , at least one of them appears in f (the probability is taken over the coin ips of the (possibly probabilistic) algorithm B ). It should be clear that the class of polynomially explainable DNFs is a strict generalization over the class of, say, k-DNF, for any xed k. This difference might be important in the current context. It might be possible that some structural constraints govern the generation of the monomials, placing it in M, but the size of the monomials is not xed. As a canonical example for the difference, consider the following example. Let S1 , . . . , St be subsets of fx1 , . . . , xn g, where t is polynomial in n. Let M be any collection of monomials with the property that for every m 2 M, the set of variables in m is Si for some 1 · i · t (i.e., any set Si may correspond to at most 2 |Si | monomials in M, by choosing for each xj 2 Si whether xj or xNj appears in the monomial). If there exists an efcient algorithm B that on input n enumerates these sets (and possibly some more), then C M is polynomially explainable. The reason is that although M may contain an exponential number of monomials, given an example ( x1 , . . . , xn ), only one element in each of the Si s might correspond to it. That is, the data-driven nature of the process allows working with families of features that could be very large, provided that only a reasonable number of them (polynomial, in our denition) is active in each observation. As a concrete instantiation, consider the class of all DNF formulas in which the variables in each monomial have consecutive indices—for ex6 k-DNF, for any constant ample, f D x1 xN 2 x3 x4 _ xN 4 xN 5 x6 _ x8 x9 . Clearly, ¡ ¢ f 2 k. However, it is easy to enumerate the n2 < n2 sets Si, j (1 · i · j · n) dened by Si, j D fxi , xi C 1 , . . . , xj g and, as above, although the number of corresponding monomials is exponential, given an example, only one is relevant in each Si, j . A similar situation occurs when learning representations of visual objects. In this case, the sets might be dened by structural constraints, and each pixel will take a constant number of values (albeit larger than 2, as in the above examples) but only one of these will be relevant given an example. Example 1.
We note that in principle, it is possible to abstract the generation of the conjunctions into the RGFs (see denition 1). However, we would like to emphasize the generation of conjunctions over simple relations and the possibility of learning on top of it, given arguments in the literature of its
1078
D. Roth, M.-H. Yang, and N. Ahuja
effectiveness and potential biological plausibility (Fleuret & Geman, 1999; Ullman & Soloviev, 1999). The elements generated by the algorithm B , the explanations, are what we will later call the “features” supplied to the learning algorithm. Denition 2 implements the assumptions that we abstracted in section 1. The algorithm B is the procedural part. B keeps a small set of syntactically simple denitions, the RGFs, and given an image it outputs those instantiations that are present in the image. All we require here is that with nonnegligible probability, it outputs at least one “relevant” explanation (and possibly many that are irrelevant). The class C M implements our representational assumption; we assume that each object has a simple (disjunctive) representation over the relational monomials. This assumption can be veried only experimentally, as we do later in this article. We emphasize that f itself is not given to the algorithm B. Also note that a function f in the class C M may have few equivalent representations as a disjunction of monomials in M. The denition requires the output of the algorithm B to satisfy the above property, independent of which of these representations of f are considered. The importance of this will become clear when we analyze the learning algorithm below. If C M is polynomially explainable, then C M is expected mistake bound learnable. Furthermore, if C M is polynomially explainable by an algorithm B that always outputs at least one term of f (i.e., g ( n ) ´ 1), then C M is mistake bound learnable. Theorem 2.
Proof. The algorithm is similar to an algorithm presented in Blum (1992),
which learns a disjunction in the innite attribute model. The algorithm maintains a hypothesis h, which is a disjunction of monomials. Initially, h contains no monomials (i.e., h ´ FALSE). Upon receiving an example e, the algorithm predicts h (e ) ; if the prediction is correct, h is not updated. Otherwise, upon a mistaken prediction, it proceeds as follows: If e is positive: Execute B (the algorithm guaranteed by the assumption that C M is polynomially explainable) on the example e, and add the monomials it outputs to h. If e is negative: Remove from the hypothesis h all the monomials that are satised by e (there must be at least one). The analysis of the algorithm is straightforward for the case g ( n) ´ 1 and more subtle in general. To analyze the algorithm, we rst x a representation for f as a disjunction of monomials in M (in case f has more than one possible representation, choose one arbitrarily; we can work with any representation of f that uses only monomials in M). Now note that an active monomial (i.e., a monomial that appears in this representation of the target
Learning to Recognize Three-Dimensional Objects
1079
function f ) is never removed from h. Therefore, since on a positive example e, the algorithm B is guaranteed to output at least one monomial that appears in f with probability at least 1 / g ( n ) , then the expected number of mistakes made on positive examples is at most p ( n ) ¢ g ( n ). This also implies that the expected total number of monomials included in h during the execution of the algorithm is not more than p ( n ) ¢ q ( n ) ¢ g ( n ).4 Each mistake on a negative example results in removing at least one of the monomials included in h but not in f . The expected number of these monomials is therefore at most p ( n ) ¢ q ( n ) ¢ g (n ) . The expected total number of mistakes made by the algorithm is O ( p ( n) ¢ q ( n ) ¢ g ( n ) ) . Finally, note that in the case g ( n ) ´ 1, we get a truly mistake-bound algorithm, whose number of mistakes is bounded by p ( n ) ¢ q ( n ). The algorithm used in practice, in SNoW, is conceptually similar. The main difference is that the hypothesis h used is a general linear threshold function over elements in M rather than a disjunction, which is a restricted linear threshold function. Algorithmically, rather than dropping elements from it, their weights are updated. The details of this process (see section 4) are crucial for our approach to be computationally feasible for large-scale domains and for robustness. In order to expand the theoretical results to this case, we appeal to the results in Littlestone (1988), modied to the case of the innite attribute domain. If the target function is indeed a disjunction over elements in M, our results hold directly, with a much improved mistake bound that depends mostly on the number of elements in M that are actually relevant to f . If f is a general linear function over M, this behavior still holds with an additional dependence on the margin between positive and negative examples, as measured in the M space (d, in Littlestone, 1988; see details there). 2.3 Robustness. Any realistic learning framework needs to support different kinds of noise in its input. Several kinds of noise have been studied in the literature in the context of PAC learning, and algorithms of the type we consider here have been shown to be robust to them. The most studied type of noise is that of classication noise (Kearns & Li, 1993) in which the examples are assumed to be given to the learning algorithm with labels that are ipped with some probability, smaller than 1 / 2. Learning in our framework can be shown to be robust to this kind of noise, as well as to a more realistic case of attribute noise, in which the description of the input itself is corrupted to a certain degree. We believe that this is the type of noise that is more relevant in the current case. First, learning is done in terms of 4 Note that if B was guaranteed only to give a monomial that appears in some representation of f , then this bound is false (as it could be the case that the active monomials in different executions of B belong to different representations of f ). This explains the requirement of the denition that seems too strong.
1080
D. Roth, M.-H. Yang, and N. Ahuja
the output of the RGFs, which may introduce some noise. Second, attribute noise is related to occlusion noise, which is important in object recognition. Specically, attribute noise can be used to model the type of noise that usually occurs when other objects appear in the image, behind or in front of the target object. This is formalized next using the notion of domination. Let f1 , f2 be two concepts. We say that f1 is k-dominated by f2 if each f1 example can be obtained from an f2 example by ipping the (binary) value of at most k of the active relations. In this case, f2 k-dominates f1 . The labels of the examples, however, are generated according to the original concept, before the noise is introduced. If a class C M is learnable by virtue of being polynomially explainable, then it is learnable even if examples of the target class are cluttered by a k-domination attribute noise, for any constant k. Theorem 3.
Proof. The proof is an extension of the arguments in Littlestone (1991) re-
garding robustness to attribute noise to the case of the innite attribute model. It basically shows that the same algorithm works in the noisy case, only that the number of ipped relations affects the number of examples that the algorithm is required to see before it stops making mistakes. 3 From Theory to Practice
Several issues need to be addressed in order to exhibit the practicality of our learning framework. The rst is the availability of a variety of RGFs that can be used to extract primitive visual patterns from data under different conditions and that are expressive enough so that a simple function dened on top of them is enough to discriminate an object. A basic assumption underlying this work is that this is not hard to do. In this work, we illustrate the approach by using simple edge detectors (clearly, too simple for a realistic situation). The second issue is the composition of complex, albeit simply dened, relations from primitive ones. This is crucial since it allows the representation of complex functions in terms of the instantiated relations by learning simple functional descriptions over their compositions. A language that supports composition of restricted families of conjunctions and can encode structural relations in images (e.g., above, to the left of . . .) is discussed in Cumby and Roth (2000). The current work uses only general conjunctions and restricts only their size. Again, our working assumption is that even this simple representation is enough to generate a family of discriminating features. Specically, the above discussion amounts to assuming that it is (1) easy to extract simple relations over images—short vertical and horizontal edges in our case; (2) easy to extract simple monomials over these—each of our features represents the presence of some short conjunction of edges in the
Learning to Recognize Three-Dimensional Objects
1081
Figure 1: The short vertical and horizontal edges are extracted from an object image. These edges (and the conjunctions of them) represent the features of the object.
image; and (3) a linear threshold function (generalizing the simple disjunction in the theorems) over these features can be used to discriminate an object from other objects. We represent each object, therefore, as a linear function over the conjunctive features. In Figure 1, we exemplify the feature-based representation extracted for three objects when the features used are only short vertical and horizontal conjunctions (no conjunctions of those are used here). Our theory does not commit to this simplied and clearly insufcient representation. Nevertheless, such simple features capture sufcient information to recognize the 100 objects in the Columbia Object Image Library (COIL 100) database, as will be demonstrated. Finally, the issue of the learnability of these representation is crucial in our approach. In learning situations in vision, the number of relations compositions (features) that could potentially affect each decision is very large, but typically, only a small number of them is actually relevant to a decision. Beyond correctness, a realistic learning approach therefore needs to be feature efcient (Littlestone, 1988) in that its learning complexity (the number of examples required for the learning algorithm to converge to a function that is a good discriminator) depends on the number of relevant features and not the global number of features in the domain. Equivalently, this can be phrased as the dependence of the true error on the error observed on the training data and the number of examples observed during training. For a feature-efcient algorithm, the number of training examples required in order to generalize well—to have true error that is not far from the error on the training data—is relatively small (Kivinen & Warmuth, 1995b).
1082
D. Roth, M.-H. Yang, and N. Ahuja
A realistic learning approach in this domain should also allow the use of variable input size, for two reasons. First, learning is done in terms of relations that are generated from the image in a data-driven way, making it impossible, or impractical, for a learning approach to write explicitly, in advance, all possible relations and features. Similarly, dealing with a variable input size also allows an on-line learning scenario. Second, for computational efciency purposes, since only a few of the many potential features are active in any instance, using a variable input size allows the complexity of evaluating the learning hypothesis on an instance to depend on the number of active features in the input rather than the total number in the domain. Given that, the learning approach used in this work is the one developed within the SNoW learning architecture (Roth, 1998; Carlson et al., 1999). SNoW is specically tailored for learning in domains in which the potential number of features taking part in decisions is very large but may be unknown a priori, as in the innite attribute learning model (Blum, 1992). Specically, as input, the algorithm receives labeled instances < ( x, l) > , where an instance x 2 f0, 1g1 is presented as a list of all the active features in it and the label is a member of a discrete set of values (e.g., object identiers). Given a domain instance (an image), a set of preexisting RGFs is evaluated on it and generates a collection of relations that are active in this image; these in turn may be composed to generate complex features, the elements of M. A list (of unique identiers) of active elements in M is presented to the learning procedure, and learning is done at this level. Specically, given an image, the learning algorithm receives as input a description of it in terms of elements in M that are active in it. As required by the theorems already presented, at least some elements in this description should be relevant to the functional description of the target object, but many others may not be. The learning algorithm will quickly determine, by determining the weight of different features, the appropriate linear combination of features that can be used as a good discriminator. We note that conceptually similar approaches have been developed by several researchers (Fleuret & Geman, 1999; Amit & Geman, 1999; Tieu & Viola, 2000; Mel & Fiser, 2000). In all cases, the approach is based on generating a fairly large number of features—typically generated as conjunctions of primitive feature—and hoping that a learning approach could learn a reliable discriminator as a function of these. These approaches made use of simple statistical methods (Mel & Fiser, 2000) or learning algorithms such as decision tree and AdaBoost (Fleuret & Geman, 1999; Tieu & Viola, 2000). Our approach differs in that (1) given some reasonable assumptions, we suggest a theoretical framework in which justications can be developed and in which performance can be quantied as a function of the expressivity of the features used, and (2) it makes use of a different computational paradigm that we believe to be more appropriate in this domain. We use a featureefcient learning method that provides the opportunity to learn high-order
Learning to Recognize Three-Dimensional Objects
1083
Figure 2: (Top) Images of the same object taken at different view points. We use SNoW to learn the representations of each object: one subnetwork of features for the pill box (b) and the other for mug (f). (a, e) respectively depict the features with top weights. Given a test image (b) and the learned representations (a, e), (c) shows the active features in the pill box subnetwork (a), and (d) shows the active features in the mug subnetwork (e). Given the test image (f) and the learned representations (a, e), (g) shows the active features in the mug subnetwork (e), and (h) shows the active features in the pill box subnetwork (a).
discriminators efciently by increasing the dimensionality of the feature space (in a data-driven way) and still learn a linear function in the augmented feature space. This eliminates the need to perform explicit feature selection as in the other methods—it is done automatically and efciently in the learning stage—or to deal with each feature separately (as in AdaBoost). In Figure 2, we use two objects to illustrate the learned representation after training on a collection of examples, as well as the features in the learned representation that are active in a new example.
1084
D. Roth, M.-H. Yang, and N. Ahuja
The images of a pill box object class and a mug object class taken at different view points (5 degrees apart) are shown at the top of gure 2. We extract small edges (of length 3) from each example and use these as the representation of each instance in training a SNoW system: one subnetwork for the pill box object class and the other for the mug object class (See also Figure 1 for examples of the input representation to SNoW.). Figures 2a and 2e depict the dominant features in the representations learned for the pill box object class and the mug object class (the weights of the feature are not shown; we show only those that have relatively high weights). Given the pill box test image shown in Figure 2b and the two learned subnetworks, Figure 2c shows the features of the test image that are active in the pill box subnetwork. Again, weights are not shown, but even in this way, it can be viewed as evidence that the test image belongs to the pill box object class. Figure 2d shows the features of the same test image that are active in the mug subnetwork. Here we see low evidence too that the test image belongs to the mug object class. The mug test image shown in Figure 2f and the two learned subnetworks, Figures 2g and 2h, show features of the test image that are active in the mug and the pill box subnetworks. It seems clear, even with the simpler features shown here (only short edges) and without the weight information on the features, that the learned representations can discriminate images of objects taken from one target class from those taken from a different class of objects. 4 The SNoW Learning Architecture
The SNoW (Sparse Network of Winnows5 ) learning architecture is a sparse network of linear units over a common predened or incrementally learned feature space. Nodes in the input layer of the network typically represent relations over the input instance and are being used as the input features. Each linear unit, called a target node, represents a concept of interest over the input. In the current application, target nodes represent a denition of an object in terms of the elements of M—features extracted from the 2D image input. An input instance is mapped into a set of features active in it; this representation is presented to the input layer of SNoW and propagates to the target nodes. Target nodes are linked via weighted edges to (some of) the input features. Let At D fi1 , . . . , im g be the set of features that are active in an example and are linked to the target node t. Then the linear unit corresponding to t is active iff X
wti > ht ,
i2At
5
To winnow: to separate chaff from grain.
Learning to Recognize Three-Dimensional Objects
1085
where wti is the weight on the edge connecting the ith feature to the target node t, and ht is the threshold for the target node t. Each SNoW unit may include a collection of subnetworks, one for each of the target relations but all using the same feature space. A given example is treated autonomously by each target unit; an example labeled t may be treated as a positive example by the t unit and as a negative example by the rest of the target nodes in its subnetwork. At decision time, a prediction for each subnetwork is derived using a winner-take-all policy. In this way, SNoW may be viewed as a multiclass predictor. In the application described here, we may have one unit with target subnetworks for all the target objects or we may dene different units, each with two competing target objects. SNoW’s learning policy is on-line and mistake driven. Several update rules can be used within SNoW; the most successful and the only one used in this work is a variant of Littlestone’s Winnow update rule (Littlestone, 1988), a multiplicative update rule that is tailored to the situation in which the set of input features is not known a priori, as in the innite attribute model (Blum, 1992). This mechanism is implemented via the sparse architecture of SNoW. That is, (1) input features are allocated in a data-driven way—an input node for the feature i is allocated only if the feature i was active in any input vector—and (2) a link (i.e., a nonzero weight) exists between a target node t and a feature i if and only if i was active in an example labeled t. One of the important properties of the sparse architecture is that the complexity of processing an example depends on only the number of features active in it, na , and is independent of the total number of features, nt , observed over the lifetime of the system. This is important in domains in which the total number of features is very large but only a small number of them is active in each example. The Winnow update rule has, in addition to the threshold ht at the target t, two update parameters: a promotion parameter a > 1 and a demotion parameter 0 < b < 1. These are being used to update the current representation of the target t (the set of weights wti ) only when a mistake in prediction is made. Let At D fi1 , . . . , im g be the set of P active features linked to the target node t. If the algorithm predicts 0 (that is, i2At wti · ht ) and the received label is 1, the active weights in the current example are promoted in a multiplicative fashion: 8i 2 At , wti à a ¢ wti .
P If the algorithm predicts 1 ( i2At wti > ht ) and the received label is 0, the active weights in the current example are demoted: 8i 2 At , wti à b ¢ wti . All other weights are unchanged. The key feature of the Winnow update rule is that the number of examples required to learn a linear function grows linearly with the number
1086
D. Roth, M.-H. Yang, and N. Ahuja
nr of relevant features and only logarithmically with the total number of features. Specically, in the sparse model, the number of examples required before converging to a linear separator that separates the data (provided it exists) scales with O ( nr log na ) , where na is the number of active features observed. This property seems crucial in domains in which the number of potential features is vast but a relatively small number of them is relevant. Winnow is known to learn efciently any linear threshold function (in general, the number of examples scales inversely with the margin (Littlestone, 1988) and to be robust in the presence of various kinds of noise and in cases where no linear-threshold function can make perfect classications, while still maintaining its dependence on the number of total and relevant attributes (Littlestone, 1991; Kivinen & Warmuth, 1995a). Once target subnetworks have been learned and the network is being evaluated, a decision support mechanism is employed, which selects the dominant active target node in the SNoW unit via a winner-take-all mechanism to produce a nal prediction. Figures 3, 4, and 5 provide more details on the SNoW learning architecture. Essentially, the SNoW learning architecture inherits its generalization properties from the update rule being used—the Winnow rule in this case— but there are a few differences worth mentioning relative to simply using the basic update rule. First, SNoW allows the use of a variable input size via the innite attribute domain. Second, SNoW is more expressive than the basic Winnow rule. The basic Winnow update rule makes use of positive weights only. Standard augmentation, for example, via the duplication trick (Littlestone, 1988), is infeasible in high-dimensional spaces since they diminish the gain from using variable-size examples (since half of the features become active). More sophisticated approaches such as using the balanced version of Winnow apply only to the case of two classes, while SNoW is a multiclass classier. Other extensions offered in SNoW relative to the standard update rule include an involved feature pruning method and a prediction condence mechanism (Carlson, Rosen, & Roth, 2001).
FDZ
C
D f0, 1, . . .g
/* Set of potential features */
T D ft1 , . . . t k g ½ F Ft µ F
tNET D f[(i, wti ): i 2 Ft ], ht g
/* Set of targets */ /* Set of features linked to target t */ /* The representation of the target t */
activation: T !
ht )&(t 2
/* predicted positive on negative example */
for each i 2 e: wit à wti ¢ b If (activation(t) · ht )& (t 2 e) for each i 2
e: wit
Ã
wti
/*predicted negative on a positive example*/
¢a
Procedure UpdateArchitecture(t, e)
If t 2 e for each i 2 enFt , set wti D w
/* Link feature to target; set initial weight */
Otherwise: do nothing Procedure MakeDecisio n(SNoW, e)
Predict winner = arg maxt2T activation(t)
Figure 5: SNoW: Main procedures.
/* Winner-take-all Prediction */
1088
D. Roth, M.-H. Yang, and N. Ahuja
The SNoW learning architecture has been used successfully on a variety of large-scale problems in the natural language domain (Roth, 1998; Munoz, Punyakanok, Roth, & Zimak, 1999; Golding & Roth, 1999) and only recently has been attempted on problems in the visual domain (Yang et al., 2000c). Finally we note that, although not crucial to the main topic of this article, the SNoW learning architecture and the general approach of generating features for it could also be defended with neural plausibility in mind. First, SNoW represents objects in units that have only positive, excitatory connections corresponding to physiological facts that the ring rates of neurons cannot be negative; inhibitory effects of features arise from their occurrences in the representations of other objects. Moreover, a consequence of the weight update rule in the SNoW unit is that synapses do not change sign. Second, generating features as conjunctions of simple binary detectors (which can be SNoW units themselves) has been suggested before as an effective representation with potential biological plausibility (Fleuret & Geman, 1999; Ullman & Soloviev, 1999). 5 Discussion of Learning Methods
In this article, our learning approach is compared mostly to another linear learning approach: support vector machines (SVM). Below, we present the SVM method in some detail. 5.1 Support Vector Machines. SVM (Vapnik, 1995; Cortes & Vapnik, 1995) is a general-purpose learning method for pattern recognition and regression problems that is based on the theory of structural risk minimization. According to this principle, as described in section 2, a function that describes the training data well and belongs to a set of functions with low VC dimension will generalize well (that is, will guarantee a small, expected recognition error for the unseen data points) regardless of the dimensionality of the input space (Vapnik, 1995). Based on this principle, the SVM is a systematic approach to nd a linear function (a hyperplane) that belongs to a set of functions of this form with the lowest VC dimension. Linear classiers are used for computational purposes and, in addition, have the nice property that it is possible to quantify the VC dimension of this function class explicitly in terms of the minimal distance (margin) between positive and negative points (assuming the data is linearly separable). For expressivity, SVMs provide nonlinear function approximations by mapping the input vectors into a high-dimensional feature space where a linear hyperplane that separates the data exists. It can also be extended to cases where the best hyperplane in the resulting high-dimensional space does not quite separate all the data points. Given a set of samples (x 1 , y1 ) , ( x 2 , y2 ), . . ., ( x l , yl ) where x i 2 RN is the input vector and yi 2 f¡1, 1g is its label, an SVM aims to nd a separating hyperplane with the property that the distance it has from points of either
Learning to Recognize Three-Dimensional Objects
1089
class (margin distance) is maximized. Vapnik (1995) shows that maximizing the margin distance is equivalent to minimizing the VC dimension and therefore contributes to better generalization. The problem of nding the optimal hyperplane is thus posed as a constrained optimization problem and solved using quadratic programming techniques. The optimal hyperplane, which determines the class label of a data point x 2 RN , is of the form Á ! l X f ( x ) D sgn yi ai ¢ k ( x , xi ) C b iD 1
where k (¢, ¢) is a kernel function, used to map the original instance space to a high-dimensional space, b is a bias term, and sgn is the function that outputs C 1 on positive inputs and ¡1 otherwise. Constructing an optimal hyperplane is equivalent to determining the nonzero ai ’s. Sample vectors x i that correspond to a nonzero ai are called the support vectors (SVs) of the optimal hyperplane. The hope, when using this method, is to nd a small number of SVs, thereby producing a compact classier. The use of kernel functions allows avoiding the need to blow up the dimensionality explicitly in order to reach a state in which the sample is linearly separable. If the kernel is of the form k( x, xi ) D W ( x ) ¢ W ( xi ) for some nonlinear function W: RN ! RM , the computation can be done in the original low-dimensional space rather than in the M-dimensional space, although the hyperplane is constructed in RM . For a linear SVM, the kernel function is simply the dot product of vectors in the input space. Several kernel functions, such as polynomial functions and radial basis functions, have this property (Mercer’s theorem) and can be used in nonlinear SVM, allowing the construction of a variety of learning machines, some of which coincide with classical architectures. However, this also results in a drawback since one needs to nd the “right” kernel function when using SVMs. It is interesting to observe that although the use of kernel functions seems to be one of the advantages of SVMs from a theoretical point of view, many experimental studies have used linear SVMs, which were found to perform better than higher-level kernels (Pontil & Verri, 1998). This might be due to the fact that higher-level kernels imply in general worse generalization bounds, and thus require more examples to generalize well. 5.2 SNoW and SVMs. It is worthwhile to discuss the similarities and differences between the computational approaches we experiment with and to develop expectations to differences in the results. At a conceptual level, both learning methods are very similar. They both search for a linear function that best separates the training data. Both are based on the same inductive principle: performing well on the training data
1090
D. Roth, M.-H. Yang, and N. Ahuja
with a classier of low expressivity would result in good generalization on data sampled from the same distribution. Both methods work by blowing up the original instance space to a highdimensional space and attempt to nd a linear classier in the new space. This gives rise to one signicant difference between the methods. SVMs are a close relative of additive update algorithms like the perceptron (Rosenblatt, 1958; Freund & Schapire, 1998). For these algorithms, the dimensionality increase is done via the kernel functions and thus need not be done explicitly. The multiplicative update rule used in SNoW does not allow the use of kernels, and the dimensionality increase has to be done explicitly. Computationally, this could be signicant. However, SNoW allows for the use of a variable input space, and since the feature space is sparse, it turns out that SNoW is actually more efcient than current SVM implementations. This advantage is signicant when the examples are sparse (the number of active features in each example is small), and it disappears when there are many active features in each example, where the kernel-based methods are advantageous computationally. In addition, RGFs, which are the equivalent notion to kernels, could allow for more general transformations than those allowed by kernels (although in this work, we use conjunctions that are polynomial kernels). A second issue has to do with the way the two methods determine the coefcients of the linear classier and the implication this has for their generalization abilities. In SVMs, the weights are determined based on a global optimization criterion that aims at maximizing the margin, using a quadratic programming scheme. The generalization bounds achieved this way are related to those achieved by perceptron (Graepel, Herbrich, & Williamson, 2001). SNoW makes use of an on-line algorithm that attempts to minimize the number of mistakes on the training data; the loss function used to determine the weight update rule can be traced to the maximum entropy principle (Kivinen & Warmuth, 1995a). The implications are that while, in the limit, SVMs might nd the optimal linear separator, SNoW has significant advantages in sparse spaces—those in which a few of the features are actually relevant (Kivinen & Warmuth, 1995a). We could expect, therefore, that in domains with these characteristics, if the number of training examples is limited, SNoW will generalize better (and in general will have better learning curves). In the limit, when sufcient examples are given, the methods will be comparable. Finally, there is one practical issue: SVMs are binary classiers, while SNoW can be used as a multiclass classier. However, to get a fair comparison, we use SNoW here as a binary classier as well, as described below. 6 View-Based Object Recognition
The appearance of an object is the combined effects of its shape, reectance properties, pose, and illumination in the scene. While shape and reectance
Learning to Recognize Three-Dimensional Objects
1091
are intrinsic properties that do not change for a rigid object, pose and illumination vary from one scene to another. View-based recognition methods attempt to use data observed under different poses and illumination conditions to learn a compact model of the object’s appearance; this, in turn, is used to resolve the recognition problem from view points that were not observed previously. A number of view-based schemes have been developed to recognize 3D objects. Poggio and Edelman (1990) show that 3D objects can be recognized from the raw intensity values in 2D images (we call this representation here a pixel-based representation) using a network of generalized radial basis functions. Each radial basis generalizes and stores one of the example views and computes a weighting factor to minimize a measure of the error between the network’s prediction and the desired output for each of the training examples. They argue and demonstrate that the full 3D structure of an object can be estimated if enough 2D views of the object are provided. This work has been extended to object categorization (Risenhuber & Poggio, 2000). (See also Edelman, 1999, for more details.) Turk and Pentland (1991) demonstrate that human faces can be represented and recognized by “eigenfaces.” Representing a face image as a vector of pixel values, the eigenfaces are the eigenvectors associated with the largest eigenvalues, which are computed from a covariance matrix of the sample vectors. An attractive feature of this method is that the eigenfaces can be learned from the sample images in pixel-based representation without any feature selection. The eigenspace approach has since been used in vision tasks from face recognition to object tracking. (Murase and Nayar, 1995; Nayar, Nene, & Murase, 1996) develop a parametric eigenspace method to recognize 3D objects directly from their appearance. For each object of interest, a set of images in which the object appears in different poses is obtained as training examples. Next, the eigenvectors are computed from the covariance matrix of the training set. The set of images is projected to a low-dimensional subspace spanned by a subset of eigenvectors in which the object is represented as a manifold. A compact parametric model is constructed by interpolating the points in the subspace. In recognition, the image of a test object is projected to the subspace, and the object is recognized based on the manifold it lies on. Using a subset of COIL 100, they show that 3D objects can be recognized accurately from their appearances in real time. In contrast to these algebraic methods, general-purpose learning methods such as SVMs have also been used for this problem. Sch¨olkopf (1997) applies SVMs to recognize 3D objects from 2D images and demonstrate the potential of this approach in visual learning. Pontil and Verri (1998) also use SVMs for 3D object recognition and experimented with a subset of the COIL 100 data set. Their training set consisted of 36 images (one for every 10 degrees) for each of the 32 objects they chose, and the test sets consist of the remaining 36 images for each object. For 20 random selections of 32
1092
D. Roth, M.-H. Yang, and N. Ahuja
objects from the COIL 100, their system achieves perfect recognition rate (but see the comments on that in section 7). Recently, Roobaert and M. Van Hulle (1999) also used a subset of the COIL 100 database to compare the performance of SVMs with different pixel-based input representations. Instead of using the whole appearance of an object for object recognition, several methods have used local features in visual learning. Le Cun et al. (1995) apply a convolutional neural network with local features to handwritten digit recognition, with very good results. They also demonstrate that their learning method is able to extract salient local features from example images without complicated and elaborated algorithms. The idea of estimating joint statistics of local features has been used in recent work. Amit and Geman (1999) use conjunction of edges as local features of image and apply tree classiers to recognize handwritten digits. Schneiderman and Kanade (1998) use naive Bayes classier to model the joint distribution of features from face images and use the learned model for face detection with success. Viola and colleagues (De Bonet & Viola, 1998; Rikert, Jones, & Viola, 1999; Tieu & Viola, 2000) assume that the appearance of an object in one image is generated by a sparse set of visual local causes (i.e., features). Their method computes a large set of selective features from examples to capture local causal structure and applies a variation of AdaBoost (Freund & Schapire, 1997) with gaussian models to learn a hypothesis of an object. Other examples include object recognition using high-dimensional iconic representations (Rao & Ballard, 1995), multidimensional histograms (Schiele, 2000), local curve features (Nelson & Selinger, 1998), conjunction of local features (Mel, 1997), SVMs with local features using wavelets (Papageorgiou & Poggio, 2000), local principal component analysis (Penev & Atick, 1996) and independent component analysis (Donato, Bartlett, Hager, Ekman, & Sejnowski, 1999). The current work is most related conceptually to a collection of other works that make use of local features and their joint statistics (De Bonet & Viola, 1998; Rikert et al., 1999; Nelson & Selinger, 1998; Amit & Geman, 1999; Schiele, 2000). As in these works, the key assumption underlying our work is that objects can be recognized (or discriminated) using simple representations in terms of syntactically simple relations over the raw image. Based on these assumptions, this work provides a learning theory account for the problem of object recognition within the PAC model of learnability. Moreover, the computational approach developed and supported here is different from previous approaches and more suitable, we believe, to realistic visual learning situations. Although the number of these simple relations could be huge, at the basis of our computational approach is the belief that only a few of them are actually present in each observed image and a fairly small number of those observed is relevant to discriminating an object. Under these assumptions, our framework has several theoretical advantages that we described in the previous sections. For our framework to contribute to a practical solution,
Learning to Recognize Three-Dimensional Objects
1093
there also needs to be a computational approach that is able to learn efciently (in terms of both computation and number of examples) in the presence of a large number of potential explanations. Our evaluation of the theoretical framework makes use of the SNoW learning architecture (Roth, 1998; Carlson et al., 1999), which is tailored for these kind of tasks. Next we use the COIL 100 data set for and quantitative experimental evaluation. 7 Experimental Evaluation
We use the COIL 100 database in all the experiments below (COIL is available on-line at www.cs.columbia.edu/CAVE). The data set consists of color images of 100 objects where the images of the objects were taken at pose intervals of 5 degrees, for 72 poses per object. The images were also normalized such that the larger of the two object dimensions (height and width) ts the image size of 128 £ 128 pixels. Figure 6 shows the images of the 100 objects taken in frontal view (i.e., zero pose angle). The 32 highlighted objects in Figure 6 are considered more difcult to recognize in Pontil and Verri (1998); we use all 100 objects, including these in our experiments. Each color image is converted to a gray-scale image of 32 £ 32 pixels for our experiments. 7.1 Ground Truth of the COIL 100 Data Set. At rst glance, it seems difcult to recognize the objects in the COIL data set because it consists of a large number of objects with varying pose, texture, shape, and size. Since each object has 72 images of different poses (5 degrees apart), many viewbased recognition methods use 36 (10 degrees apart) of them for training and the remaining images for testing. However, it turns out that under these dense sampling conditions, the recognition problem is not difcult (even when only gray-level images are used). In this case, instances that belong to the same object are very close to each other in the image space (where each data point represents an image of an object in a certain pose). We veried this by experimenting with a simple nearest-neighbor classier (using the Euclidean distance), resulting in an average recognition rate of 98.50% (54 errors out of 3,600 tests). Figure 7 shows some of the objects misclassied by the nearest-neighbor method. In principle, one may want to avoid using the nearest-neighbor method since it requires a lot of memory for storing templates and its recognition time complexity is high. The goal here is simply to show that this method is comparable to the complex SVM approaches (Pontil & Verri, 1998; Roobaert & Hulle, 1999) for the case of dense sampling. Therefore, the recognition problem is not appropriate for comparison among different methods. It is interesting to see that the pairs of the objects on which the nearestneighbor method misclassied have similar geometric congurations and similar poses. A close inspection shows that most of the recognition errors are made between the three packs of chewing gums, bottles, and cars. Other
1094
D. Roth, M.-H. Yang, and N. Ahuja
Figure 6: Columbia Object Image Library (COIL 100) consists of 100 objects of varying poses 5 degrees apart. The objects are shown in row order; the highlighted ones are those considered more difcult to recognize.
dense sampling cases are easier for this method. Consequently, the set of selected objects in an experiment has direct effects on the recognition rate. This needs to be taken into account when evaluating results that use only a subset of the 100 objects (typically 20 to 30) from the COIL data set for experiments. Table 1 shows the recognition rates of nearest-neighbor classiers in several experiments in which 36 poses of each object are used for templates and the remaining 36 poses are used for tests. Given this baseline experiment, we decided to perform our experimental comparisons in cases in which the number of views of objects available in training is limited. Some of our preliminary results were presented in Yang, Roth, and Ahuja (2000a, 2000b).
Learning to Recognize Three-Dimensional Objects
1095
Figure 7: Mismatched objects using the nearest-neighbor method. ( x : a, y : b) means that object x with view angle a is recognized as object y with view angle b. It shows some of the 54 errors (out of 3600 test samples) made by the nearestneighbor classier when there are 36 views per object in the training set. Table 1: Recognition Rates of Nearest-Neighbor Classier.
Errors/tests Recognition rate
30 Objects Randomly Selected from COIL
32 Objects Shown in Figure 6 Selected by Pontil and Verri (1998)
The 100 Objects in COIL
14/1080 98.70%
46/1152 96.00%
54/3600 98.50%
7.2 Experiment Setups. Applying SNoW to 3D object recognition requires specifying the architecture used and the representation chosen for the input images. To perform object recognition, we associate a target unit with each target object. This target learns a denition of the object in terms of the input features extracted from the image. We could dene either a single SNoW unit that contains target subnetworks for all 100 different target objects or different units, each with several (e.g., two) competing target objects. Statistically, this approach is advantageous (Hastie & Tibshirani, 1998), although it clearly requires a lot more computation. The architecture selected affects the training time, where learning a denition for object a makes use of negative examples of other objects that are part of the same unit. More important, it makes a difference in testing; rather than two competing objects for a decision, there may be a hundred. The chances for a spurious mistake caused by an incidental view point are clearly much higher. It also
1096
D. Roth, M.-H. Yang, and N. Ahuja
Table 2: Experimental Results of Three Classiers Using the 100 Objects in the COIL-100 Data Set. Number of Views per Object
SNoW Linear SVM Nearest neighbor
36
18
8
4
3600 Tests
5400 Tests
6400 Tests
6800 Tests
95.81% 96.03 98.50
92.31% 91.30 87.54
85.13% 84.80 79.52
81.46% 78.50 74.63
has signicant advantages in terms of space complexity and the appeal of the evaluation mode. SVMs are two-class classiers that for a c-class pattern recognition prob( ) lem usually need to train c c¡1 binary classiers. Since we compare the 2 performance of the proposed SNoW-based method with SVMs, in order to maintain a fair comparison we have to perform it in the one-againstone scheme. That is, we use SNoW units of size two. To classify a test instance, tournament-like pair-wise competition between all the machines is performed, and the winner determines the label of the test instance. Table 2 shows the recognition rates of the SVM- and SNoW-based methods using ¡ ¢ the one-against-one scheme. (That is, we trained 100 2 D 4950 classiers for each method and evaluated 99(50 C 25 C 12 C 6 C 3 C 2 C 1) classiers on each test instance.) 7.3 Results Using Pixel-Based Representation. Table 2 shows the recognition rates of the SNoW-based method, the SVM-based method (using linear dot product for the kernel function), and the nearest-neighbor classier using the COIL 100 data set. The important parameter we vary here is the number of views (v) observed during training; the rest of the views (72 ¡ v) are used for testing. The experimental results show that the SNoW-based method performs as well as the SVM-based method when many views of the objects are present during training and outperforms the SVM-based method when the number of views is limited. Although it is not surprising to see that the recognition rate decreases as the number of views available during training decreases, it is worth noticing that both SNoW and SVM are capable of recognizing 3D objects in the COIL 100 data set with satisfactory performance if enough views (e.g., more than 18) are provided. Also they seem to be fairly robust even if only a limited number of views (e.g., 8 and 4) are used for training; the performance of both methods degrades gracefully. An additional potential advantage of the SNoW architecture is that it does not learn discriminators but rather can learn a representation for each object,
Learning to Recognize Three-Dimensional Objects
1097
Table 3: Recognition Rates of SNoW Using Two Learning Paradigms. Number of Views per Object SNoW
36
18
8
4
One-against-one One-against-all
95.81% 90.52
92.31% 84.50
85.13% 81.85
81.46% 76.00
which can then be used for prediction in the one-against-all scheme or to build hierarchical representations. See Figure 2 for examples. However, as is shown in Table 3, this implies a signicant degradation in the performance. Finding a way to make better predictions in the one-against-all scheme is one of the important issues for future investigation, to exploit the advantages of this approach better. 7.4 Results Using Edge-Based Representation. For each 32 £ 32 edge map, we extract horizontal and vertical edges (of length at least 3 pixels) and then encode as our features conjunctions of two of these edges. The ¡ ¢ number of potential features of this sort is 2048 D 2,096,128. However, only 2 an average of 1822 of these is active for objects in the COIL 100 data set. To reduce the computational cost, the feature vectors were further pruned, and only the 512 most frequently occurring features were retained in each image. Table 4 shows the performance of the SNoW-based method when conjunctions of edges are used to represent objects. As before, we vary the number of views of an object (v) during training and use the rest of the views (72 ¡ v) for testing. The results indicate that conjunctions of edges provide useful information for object recognition and that SNoW is able to learn very good object representations using these features. The experimental results also exhibit the relative advantage of this representation when the number of views per object is limited. 7.5 Simple Occlusion. This section presents some preliminary studies of object recognition in the presence of occlusion. Our current modeling of occlusion is fairly simplistic relative to studies such as Nelson and Selinger (1998) and Schiele (2000). However, given that our theoretical paradigm supports recognition in the presence of occlusion, we wanted to experiment with this in the current setting. We select a set of 10 objects6 from the COIL 100 data set and add in articial occlusions for experiments. In the data set, each object has 36 images (10 degrees apart) for training and the remaining 36 images for tests (also 10 6 More specically, the objects are selected from the set of objects on which the nearestneighbor classier makes the most mistakes: objects 8, 13, 23, 27, 31, 42, 65, 78, 80, 91.
1098
D. Roth, M.-H. Yang, and N. Ahuja
Table 4: Experimental Results of Three Classiers Using the 100 Objects in the COIL-100 Data Set. Number of Views per Object
SNoW with conjunction of edges SNoW with intensity values Linear support vector machine Nearest neighbor
36
18
8
4
3600 Tests
5400 Tests
6400 Tests
6800 Tests
96.25%
94.13%
89.23%
88.28%
95.81 96.03 98.50
92.31 91.30 87.54
85.13 84.80 79.52
81.46 78.50 74.63
Figure 8: Object images with and without occlusion as well as their edge maps.
degrees apart). The object images are occluded by a strip controlled by four parameters (a, p, l, g ), where a denote the angle of the strip, p denote the percentage of occluded image area, l denote the location of the center of the strip, and g denote the intensity values of the strip. Figure 8 shows some object images and the occluded object images for fa, p, l, gg D f45± , 15%, (16, 16) , 0g. Our trained SNoW classier is tested against this data set using the edgebased representation. Table 5 shows the experimental results with and without occlusions on this set of 10 objects with 36 views. The recognition performance degrades only slightly from 92.03% to 88.78%. Note that the objects are those on which the nearest-neighbor classier makes the most mistakes. Although our theoretical framework supports noise tolerance, it is clear that the limited expressivity of the features used in this experimental study Table 5: Experimental Results of SNoW Classier on Occluded Images with 36 Views per Object.
SNoW
Recognition Rate Without Occlusion
Recognition Rate With Occlusion
92.03%
88.78%
Learning to Recognize Three-Dimensional Objects
1099
limits the ability to tolerate occlusion; we nevertheless nd our results promising. 8 Conclusion
The main contribution of this work is proposing a learning framework for visual learning and exhibiting its feasibility. In this approach, learnability can be rigorously studied without making assumptions on the distribution of the observed objects; instead, via the PAC model, the performance of the learned hypothesis naturally depends on its prior experience. An important feature of the approach is that learning is studied not directly in terms of the raw data but rather with respect to intermediate representations extracted from it and can thus be quantied in terms of the ability to generate expressive intermediate representations. In particular, it makes explicit the requirements from these representations to allow learnability. We believe that research in vision should concentrate on the study of these intermediate representations. We evaluated the approach and demonstrated its feasibility in a largescale experiment in the context of learning for object recognition. Our experiments allowed us also to perform a fair comparison between two successful and related learning methods and study them in the context of object recognition. We have illustrated our approach in a large-scale experimental study in which we use the SNoW learning architecture to learn representations for the objects in COIL 100. Although it is clear that object recognition in isolation is not the ultimate goal, this study shows the potential of this computational approach as a basis for studying and supporting more realistic visual inferences. We note that for a fair comparison among different methods, we have used pixel-based presentation in the experiments. The experimental results suggest that the edge-based representation used is more effective and robust and should be the starting point for future research. There is no question that the RGFs used in this work are not general enough to support more challenging recognition problems; the intention was merely to exhibit the general approach. We believe that pursuing the direction of using complex intermediate representations will benet future work on recognition and, in particular, robust recognition under realistic conditions. The framework developed here is very general. The explanations used as features by the learning algorithm can represent a variety of computational processes and information sources that operate on the image. They can depend on local properties of the image, the relative positions of primitives in the image, and even external information sources or context variables. Thus, the theoretical support given here applies also to an intermediate learning stage in a hierarchical process. In order to generate the explanations efciently, this work assumes that they are syntactically simple in terms of the raw image. However, the explanation might as well be syntactically
1100
D. Roth, M.-H. Yang, and N. Ahuja
simple in terms of previously learned or inferred predicates, giving rise to a hierarchical representation. We believe that the key future research question suggested by this line of work is that of incorporating more visual knowledge into instantiations of this framework and, in particular, using it to generate better explanations.
Acknowledgments
D. R. is supported by National Science Foundation grants IIS-9801638, IIS9984168, and ITR-IIS-0085836, M-S. Y. was supported by a Ray Ozzie Fellowship and Ofce of Naval Research (ONR) grant N00014-96-1-0502 and currently is with Honda Fundamental Research Labs, Mountain View, California. N. A. is supported by ONR grant N00014-96-1-0502.
References Amit, Y., & Geman, D. (1999).A computational model for visual selection. Neural Computation, 11(7), 1691–1715. Angluin, D. (1992). Computational learning theory: Survey and selected bibliography. In Proceedings of the 24th Annual ACM Symposium on Theory of Computing (pp. 351–369). Victoria, B.C., Canada. Blum, A. (1992). Learning Boolean functions in an innite attribute space. Machine Learning, 9(4), 373–386. Carlson, A., Cumby, C., Rosen, J., & Roth, D. (1999). The SNoW learning architecture (Tech. Rep. No. UIUCDCS-R-99-2101). Urbana, IL: UIUC Computer Science Department. Carlson, A. J., Rosen, J., & Roth, D. (2001). Scaling up context sensitive text correction. In Proceedings of the Thirteenth Innovative Applications of Articial Intelligence Conference. Seattle, WA. Cortes, C., & Vapnik, V. (1995).Support vector networks. Machine Learning, 20(3), 273–297. Cumby, C., & Roth, D. (2000). Relational representations that facilitate learning. In Proceedings of the International Conference on the Principles of Knowledge Representation and Reasoning (pp. 425–434). Breckinridge, CO. De Bonet, J., & Viola, P. (1998).Texture recognition using a non-parametric multiscale statistical model. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (pp. 641–647). Santa Barbara, CA. Donato, G., Bartlett, M. S., Hager, J. C., Ekman, P., & Sejnowski, T. J. (1999). Classifying facial actions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(10), 974–989. Edelman, S. (1993). On learning to recognize 3-D objects from examples. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(8), 833–837. Edelman, S. (1999). Representation and recognition in vision. Cambridge, MA: MIT Press.
Learning to Recognize Three-Dimensional Objects
1101
Fleuret, F., & Geman, D. (1999). Graded learning for object detection. In Proceedings of the IEEE Workshop on Statistical and Computational Theories of Visions. Fort Collins, CO. Freund, Y., & Schapire, R. (1997). A decision-theoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139. Freund, Y., & Schapire, R. (1998). Large margin classication using the perceptron algorithm. In Proceedings of the Annual ACM Workshop on Computational Learning Theory (pp. 209–217). Madison, WI. Golding, A. R., & Roth, D. (1999).A Winnow based approach to context-sensitive spelling correction. Machine Learning, 34(1–3), 107–130. (Special Issue on Machine Learning and Natural Language.) Graepel, T., Herbrich, R., & Williamson, R. C. (2001). From margin to sparsity. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 210–216). Cambridge, MA: MIT Press. Hastie, T., & Tibshirani, R. (1998). Classication by pairwise coupling. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 507–513). Cambridge, MA: MIT Press. Haussler, D. (1992). Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computer, 100(1), 78–150. Haussler, D., Kearns, M., Littlestone, N., & Warmuth, M. K. (1988). Equivalence of models for polynomial learnability. In Proceedings of the 1988 Workshop on Computational Learning Theory (pp. 42–55). Cambridge, MA. Kearns, M., & Li, M. (1993). Learning in the presence of malicious error. SIAM Journal of Computing, 22(4), 807–837. Kearns, M. J., Schapire, R. E., & Sellie, L. M. (1994). Toward efcient agnostic learning. Machine Learning, 17(2/3), 115–142. Kivinen, J., & Warmuth, M. K. (1995a). Additive versus exponentiated gradient updates for linear prediction. In Proceedings of the Annual ACM Symposium on Theory of Computing (pp. 209–218). Las Vegas, NV. Kivinen, J., & Warmuth, M. K. (1995b). The perceptron algorithm vs. Winnow: Linear vs. logarithmic mistake bounds when few input variables are relevant. In Proceedings of 8th Annual Conference on Computational Learning Theory (pp. 289–296). Santa Cruz, CA. Kushilevitz, E., & Roth, D. (1996). On learning visual concepts and DNF formulae. Machine Learning, 24(1), 65–85. Le Cun, Y., Jackel, L., Bottou, L., Brunot, A., Cortes, C., Denker, J., Drucker, H., Guyon, I., Muller, ¨ U., S¨ackinger, E., Simard, P., & Vapnik, V. (1995). Comparison of learning algorithms for handwritten digit recognition. In Proceedings of International Conference on Articial Neural Networks (pp. 53–60). Paris, France. Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2, 285–318. Littlestone, N. (1991). Redundant noisy attributes, attribute errors, and linear threshold learning using Winnow. In Proceedings of the Fourth Annual Workshop on Computational Learning Theory (pp. 147–156). Santa Cruz, CA.
1102
D. Roth, M.-H. Yang, and N. Ahuja
Mel, B. W. (1997). Seemore: Combing color, shape, and texture histogramming in a neurally-inspired approach to visual object recognition. Neural Computation, 9, 777–804. Mel, B. W., & Fiser, J. (2000). Minimizing binding errors using learned conjunctive features. Neural Computation, 12, 247–278. Munoz, M., Punyakanok, V., Roth, D., & Zimak, D. (1999).A learning approach to shallow parsing. In EMNLP-VLC’99, the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (pp. 168–178). College Park, MD. Murase, H., & Nayar, S. K. (1995). Visual learning and recognition of 3-D objects from appearance. International Journal of Computer Vision, 14, 5–24. Nayar, S. K., Nene, S. A., & Murase, H. (1996). Real-time 100 object recognition system. In Proceedings of IEEE International Conference on Robotics and Automation. Minneapolis, MN. Nelson, R. C., & Selinger, A. (1998). Large-scale tests of a keyed, appearancebased 3-D object recognition system. Vision Research, 38(15/16), 2469–2488. Papageorgiou, C., & Poggio, T. (2000). A trainable system for object detection. International Journal of Computer Vision, 38(1), 15–33. Penev, P., & Atick, J. (1996). Local feature analysis: A general statistical theory for object representation. Network: Computation in Neural Systems, 7(3), 477–500. Poggio, T., & Edelman, S. (1990). A network that learns to recognize 3D objects. Nature, 343, 263–266. Pontil, M., & Verri, A. (1998). Support vector machines for 3D object recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(6), 637–646. Rao, R. P. N., & Ballard, D. H. (1995). An active vision architecture based on iconic representations. Articial Intelligence, 78(1–2), 461–505. Rikert, T., Jones, M., & Viola, P. (1999). A cluster-based statistical model for object detection. In Proceedings of the Seventh IEEE International Conference on Computer Vision (pp. 1046–1053). Kerkyra, Greece. Risenhuber, M., & Poggio, T. (2000). Models of object recognition. Nature Neuroscience, 3, 1199–1204. Roobaert, D., & Hulle, M. V. (1999). View-based 3D object recognition with support vector machines. In IEEE International Workshop on Neural Networks for Signal Processing (pp. 77–84). Madison, WI. Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386–407. Roth, D. (1998). Learning to resolve natural language ambiguities: A unied approach. In Proceedings of the Fifteenth National Conference on Articial Intelligence (pp. 806–813). Madison, WI. Schiele, B. (2000). Recognition without correspondence using multidimensional receptive eld histograms. International Journal of Computer Vision, 36(1), 31– 50. Schneiderman, H., & Kanade, T. (1998). Probabilistic modeling of local appearance and spatial relationships for object recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (pp. 45–51). Santa Barbara, CA. Sch¨olkopf, B. (1997). Support vector learning. Munich: Oldenbourg Verlag.
Learning to Recognize Three-Dimensional Objects
1103
Shvaytser, H. (1990). Learnable and nonlearnable visual concepts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(5), 459–466. Tieu, K., & Viola, P. (2000). Boosting image retrieval. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (Vol. 1, pp. 228–235). Hilton Head, SC. Turk, M., & Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1), 71–86. Ullman, S., & Soloviev, S. (1999). Computation of pattern invariance in brain-like structures. Neural Networks, 12, 1021–1036. Valiant, L. G. (1984).A theory of the learnable. Communications of the ACM, 27(11), 1134–1142. Vapnik, V. N. (1982). Estimation of dependences based on empirical data. New York: Springer-Verlag. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Yang, M.-H., Roth, D., & Ahuja, N. (2000a). Learning to recognize 3D objects with SNoW. In Proceedings of the Sixth European Conference on Computer Vision (Vol. 1, pp. 439–454). Dublin, Ireland. Yang, M.-H., Roth, D., & Ahuja, N. (2000b). View-based 3D object recognition using SNoW. In Proceedings of the Fourth Asian Conference on Computer Vision (Vol. 2, pp. 830–835). Denver, CO. Yang, M.-H., Roth, D., & Ahuja, N. (2000b). A SNoW-based face detector. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12 (pp. 855–861). Cambridge, MA: MIT Press. Received August 23, 2000; accepted July 27, 2001.
LETTER
Communicated by John Platt
A Parallel Mixture of SVMs for Very Large Scale Problems Ronan Collobert
[email protected] and
[email protected] Dalle Molle Institute for Perceptual Articial Intelligence, 1920 Martigny, Switzerland, and UniversitÂe de MontrÂeal, DIRO, MontrÂeal, QuÂebec, Canada Samy Bengio
[email protected] Dalle Molle Institute for Perceptual Articial Intelligence, 1920 Martigny, Switzerland Yoshua Bengio
[email protected] UniversitÂe de MontrÂeal, DIRO, MontrÂeal, QuÂebec, Canada Support vector machines (SVMs) are the state-of-the-art models for many classication problems, but they suffer from the complexity of their training algorithm, which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve real-life problems having more than a few hundred thousand examples with SVMs. This article proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole data set. Experiments on a large benchmark data set (Forest) yielded signicant time improvement (time complexity appears empirically to locally grow linearly with the number of examples). In addition, and surprisingly, a signicant improvement in generalization was observed. 1 Introduction
Recently a lot of work has been done around support vector machines (Vapnik, 1995), mainly due to their impressive generalization performances on classication problems when compared to other algorithms such as articial neural networks (Cortes & Vapnik, 1995; Osuna, Freund, & Girosi, 1997). However, SVMs need resources that are at least quadratic on the number of training examples in order to solve a quadratic optimization problem, and it is thus hopeless to try to solve problems having millions of examples using classical SVMs. In order to overcome this drawback, we propose in this article to use a mixture of several SVMs, each of them trained on only a part of the data set. The idea of an SVM mixture is not new, although previous attempts such as Kwok’s article (1998) on support vector mixtures trained c 2002 Massachusetts Institute of Technology Neural Computation 14, 1105– 1114 (2002) °
1106
Ronan Collobert, Samy Bengio, and Yoshua Bengio
the SVMs not on part of the data set but on the whole data set and hence could not overcome the time complexity problem for large data sets. We propose here a simple method to train such a mixture and will show that in practice, this method is much faster than training only one SVM and leads to results that are at least as good as one SVM. We conjecture that the training time complexity of the proposed approach with respect to the number of examples is subquadratic for large data sets. Moreover this mixture can be easily parallelized, which could improve the training time signicantly. In the next section, we briey introduce the SVM model for classication. In section 3, we present our mixture of SVMs, followed in section 4 by some comparisons to related models. In section 5, we show some experimental results, rst on a toy data set and then on a large real-life data set. A brief conclusion follows. 2 Introduction to Support Vector Machines
SVMs (Vapnik, 1995) have been applied to many classication problems, generally yielding good performance compared to other algorithms. The decision function is of the form y D sign
Á N X iD 1
! yi ai K (x, xi ) C b ,
(2.1)
where x 2 Rd is the d-dimensional input vector of a test example, y 2 f¡1, 1g is a class label, xi 2 Rd is the input vector for the ith training example, N is the number of training examples, K (x, xi ) is a positive denite kernel function, and ® D fa1 , . . . , aN g and b are the parameters of the model. Training an SVM consists of nding ® that minimizes the objective function Q ( ®) D ¡
N X iD 1
ai C
N X N 1X ai aj yi yj K (xi , xj ) , 2 i D 1 j D1
(2.2)
subject to the constraints N X i D1
ai yi D 0
(2.3)
and 0 · ai · C 8i.
(2.4)
SVMs for Very Large Scale Problems
1107
The kernel K (x, xi ) can have different forms, such as the radial basis function (RBF), Á ! ¡kxi ¡ xj k2 (2.5) K (xi , xj ) D exp s2 with parameter s. Therefore, to train an SVM, we need to solve a quadratic optimization problem where the number of parameters is N. This makes the use of SVMs for large data sets difcult: computing K ( xi , xj ) for every training pair would require O ( N 2 ) computation, and solving may take up to O ( N 3 ) . Note, however, that current state-of-the-art algorithms appear to have a training time complexity scaling much closer to O (N 2 ) than O (N 3 ) (Collobert & Bengio, 2001; Joachims, 1999; Platt, 1999). 3 Mixtures of SVMs
In this section, we introduce a new type of mixture of SVMs. The proposed model should minimize the following cost function, CD
N X iD 1
" Á h
M X
! wm (xi ) sm (xi )
m D1
#2 ¡ yi
,
(3.1)
where M is the number of experts in the mixture, sm (xi ) is the output of the mth expert given input xi , wm (xi ) is the weight for the mth expert given by a “gater ” module also taking xi in input, and h is a transfer function that could be, for example, the hyperbolic tangent for classication tasks. Here, each expert is an SVM, and we took a neural network for the gater in our experiments. To train this model, we propose a very simple algorithm: 1. Divide the training set into M random subsets of size near N / M. 2. Train each expert separately over one of these subsets. 3. Keeping the experts xed, train the gater to minimize the cost, equation 3.1, on the whole training set. 4. Reconstruct M subsets. For each example (xi , yi ) , Sort the experts in descending order according to the values wm (xi ) . Assign the example to the rst expert in the list that has fewer than (N / M C c) examples,1 in order to ensure a balance between the experts. 1
Where c is a small, positive constant.
1108
Ronan Collobert, Samy Bengio, and Yoshua Bengio
5. If a termination criterion is not fullled (such as a given number of iterations or a validation error going up), go to step 2. Note that step 2 of this algorithm can be easily implemented in parallel as each expert can be trained separately on a different computer. Note also that step 3 can be an approximate minimization (as usually done when training neural networks). 4 Related Models
The idea of mixture models is quite old and has given rise to very popular algorithms, such as the well-known mixture of experts (Jacobs, Jordan, Nowlan, & Hinton, 1991), where the cost function is similar to equation 3.1 but where the gater and the experts are trained, using gradient descent or expectation maximization, on the whole data set (and not subsets) and their parameters are trained simultaneously. Hence, such an algorithm is quite demanding in terms of resources when the data set is large if training time scales like O ( N p ) with p > 1. In the more recent SVM model (Kwok, 1998), the author shows how to replace the experts (typically neural networks) by SVMs and gives a learning algorithm for this model. Once again, the resulting mixture is trained jointly on the whole data set, and hence does not solve the quadratic barrier when the data set is large. In another divide-and-conquer approach (Rida, Labbi, & Pellegrini, 1999), the authors propose to divide the training set using an unsupervised algorithm to cluster the data (typically a mixture of gaussians), train an expert (such as an SVM) on each subset of the data corresponding to a cluster, and nally recombine the outputs of the experts. Here, the algorithm does indeed separately train the experts on small data sets, like the present algorithm, but there is no notion of a loop reassigning the examples to experts according to the prediction made by the gater of how well each expert performs on each example. Our experiments suggest that this element is essential to the success of the algorithm. 5 Experiments
In this section, we present two sets of experiments comparing the new mixture of SVMs to other machine learning algorithms. 5.1 A Toy Problem. In the rst series of experiments, we tested the mixture on an articial toy problem where we generated 1000 training examples and 10,000 test examples. The problem had two nonlinearly separable classes and two input dimensions. Figure 1 shows the decision surfaces obtained rst by a linear SVM, then by a gaussian SVM, and nally by the
SVMs for Very Large Scale Problems
1109
2
2
1
1
0
0
1
1
2 2
1
0
1
2
2 2
(a) Linear SVM
1
0
1
2
(b) Gaussian SVM
2
1
0
1
2 2
1
0
1
2
(c) Mixture of two linear SVMs Figure 1: Comparison of the decision surfaces obtained by (a) a linear SVM, (b) a gaussian SVM, and (c) a linear mixture of two linear SVMs, on a two-dimensional classication toy problem.
proposed mixture of SVMs. Moreover, in the proposed mixture, the gater was a simple linear function and there were two linear SVMs in the mixture. This articial problem thus shows clearly that the algorithm seems to work and is able to combine, even linearly, very simple models in order to produce a nonlinear decision surface. 5.2 A Large-Scale Realistic Problem: Forest. For a more realistic problem, we did a series of experiments on part of the UCI Forest data set. 2 Since this is a classication problem with seven classes, we modied it in a binary classication problem where the goal was to separate class 2 from the other 2 The Forest data set is available on the UCI web site at the following address: ftp://ftp.ics.uci.edu/pub/machine-learning-databases/covtype/covtype.info.
1110
Ronan Collobert, Samy Bengio, and Yoshua Bengio
Table 1: Comparison of Performances. Train
Valid
Test
Error (%) One MLPa One SVM Uniform SVM mixtureb Gated SVM mixture
17.56 16.03 19.69 5.91
18.12 16.68 19.90 8.90
Time (minutes) 1 cpu
18.15 16.76 20.31 9.28
12 3231 85 237
Total Number of
50 cpu
Support Vectors
2 73
100 42,451 52,846 31,703
Notes: a 100 Hidden Units. b The gater always output the same value for each expert.
six classes. The data set had more than 500,000 examples, and this enabled us to prepare a series of experiments as follows: We kept a test set of 50,000 examples to compare the best mixture of SVMs to other learning algorithms. We used a validation set of 10,000 examples to select the best mixture of SVMs, varying the number of experts and the number of hidden units in the gater. We trained our models on different training sets, using from 100,000 to 400,000 examples. The mixtures had from 10 to 50 expert SVMs with gaussian kernel, and the gater was a multilayer perceptron (MLP) with between 25 and 500 hidden units. Because the number of examples was quite large, we selected the internal training parameters such as the s of the gaussian kernel of the SVMs or the learning rate of the gater using a held-out portion of the training set. We compared our models to: A single MLP, where the number of hidden units was selected by crossvalidation between 25 and 250 units. A single SVM, where the parameter of the kernel was also selected by cross validation. A mixture of SVMs where the gater was replaced by a constant vector, assigning the same weight value to every expert. Table 1 gives the results of a rst series of experiments with a xed training set of 100,000 examples. We selected among the variants of the gated SVM mixture not only using performance on the validation set but also taking into account the time to train the model. The selected model had 50 experts and a gater with 150 hidden units. A model with 500 hidden units would have given a performance of 8.1% over the test set but would have taken 621 minutes on one machine (and 388 minutes on 50 machines). As
SVMs for Very Large Scale Problems
1111
Validation error (%)
14
12
10
8 2550 100 150 200 250
Number of hidden units of the gater
50
Number of experts 500
10
15
20
25
Figure 2: Comparison of the validation error of different mixtures of SVMs with various numbers of hidden units and experts.
can be seen, the gated SVM outperformed all models in terms of training, validation, and test error. It was also much faster, even on one machine, than the SVM, and since the mixture could easily be parallelized (each expert can be trained separately), we also give the time it took to train on 50 machines. It is interesting to note that the total number of support vectors of the gated SVM was fewer than the number of support vectors of the SVM. In a rst attempt to understand these results, we can at least say that the power of the model lies not only on the gater, since a single MLP had quite bad performance on the test set, not only because we used SVMs, since a single SVM was not as good as the gated mixture, and not only because we divided the problem into many subproblems since the uniform mixture also performed badly. It seems to be a combination of all these elements. We also did a series of experiments in order to see the inuence of the number of hidden units of the gater as well as the number of experts in the mixture. Figure 2 shows the validation error of different mixtures of SVMs, where the number of hidden units varied from 25 to 500 and the number of experts varied from 10 to 50. There is a clear performance improvement when the number of hidden units is increased, while the improvement with
1112
Ronan Collobert, Samy Bengio, and Yoshua Bengio
Training time as a function of the number of train examples 450 400
Time (min)
350 300 250 200 150 100 50 1
1.5
2 2.5 3 Number of train examples
3.5
4 x 10
5
Figure 3: Comparison of the training time of the same mixture of SVMs (50 experts, 150 hidden units in the gater) trained on different training set sizes, from 100,000 to 400,000.
additional experts exists but is less clear. Note, however, that the training time also increases rapidly with the number of hidden units and slightly decreases with the number of experts if one uses one computer per expert. In order to determine how the algorithm scaled with respect to the number of examples, we then compared the same mixture of experts (50 experts, 150 hidden units in the gater) on different training set sizes. Figure 3 shows the training time of the mixture of SVMs trained on training sets of sizes from 100,000 to 400,000. It seems that at least in this range and for this particular data set, the mixture of SVMs scales linearly with respect to the number of examples and not quadratically, like a classical SVM. It is interesting to see, for instance, that the mixture of SVMs was able to solve a problem of 400,000 examples in less than 7 hours (on 50 computers), while it would have taken more than one month to solve the same problem with a single SVM.
SVMs for Very Large Scale Problems
1113
Error as a function of the number of training iterations 14 13
Train error Validation Error
12
Error (%)
11 10 9 8 7 6 5 1
2 3 4 Number of training iterations
5
Figure 4: Comparison of the training and validation errors of the mixture of SVMs as a function of the number of training iterations.
Finally, Figure 4 shows the evolution of the training and validation errors of a mixture of 50 SVMs gated by an MLP with 150 hidden units, during ve iterations of the algorithm. This should be convincing that the loop of the algorithm is essential to obtain good performance. 6 Conclusion
In this article, we have presented a new algorithm to train a mixture of SVMs that gave very good results compared to classical SVMs in terms of training time, generalization performance, or sparseness. Moreover, the algorithm appears to scale linearly with the number of examples, at least between 100,000 and 400,000 examples. Furthermore, the proposed algorithm has also been tested on another task with a similar number of examples and also yielded performance improvements. These results are extremely encouraging and suggest that the proposed method could allow training SVM-like models for very large multimillion data sets in a reasonable time. If training of the neural network gater with
1114
Ronan Collobert, Samy Bengio, and Yoshua Bengio
stochastic gradient takes time that grows much less than quadratically, as we believe to be the case for very large data sets (to reach a “good enough” solution), then the whole method is clearly subquadratic in training time with respect to the number of training examples. Acknowledgments
R. C. thanks the Swiss National Science Foundation for nancial support (project FN2100-061234.00). Y. B. thanks the NSERC funding agency and NCM2 network for support. References Collobert, R., & Bengio, S. (2001). SVMTorch: Support vector machines for largescale regression problems. Journal of Machine Learning Research, 1, 143–160. Cortes, C., & Vapnik, V. (1995). Support vector networks. Machine Learning, 20, 273–297. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3(1), 79–87. Joachims, T. (1999). Making large-scale support vector machine learning practical. In B. Sch¨olkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods. Cambridge, MA: MIT Press. Kwok, J. T. (1998). Support vector mixture for classication and regression problems. In Proceedings of the International Conference on Pattern Recognition (ICPR) (pp. 255–258). Brisbane, Queensland, Australia. Osuna, E., Freund, R., & Girosi, F. (1997). Training support vector machines: An application to face detection. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 130–136). San Juan, Puerto Rico. Platt, J. (1999). Fast training of support vector machines using sequential minimal optimization. In B. Sch¨olkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods. Cambridge, MA: MIT Press. Rida, A., Labbi, A., & Pellegrini, C. (1999). Local experts combination trough density decomposition. In International Workshop on AI and Statistics (Uncertainty ’99). San Mateo, CA: Morgan Kaufmann. Vapnik, V. N. (1995). The nature of statistical learning theory. (2nd ed). New York: Springer-Verlag. Received May 15, 2001; accepted August 13, 2001.
LETTER
Communicated by John Platt
Bayesian Framewor k for Least-Squares Support Vector Machine Classiers, Gaussian Processes, and Kernel Fisher Discriminant Analysis T. Van Gestel tony.vangestelesat.kuleuven.ac.be J. A. K. Suykens johan.suykensesat.kuleuven.ac.be G. Lanckriet gert.lanckrietesat.kuleuven.ac.be A. Lambrechts annemie.lambrechtsesat.kuleuven.ac.be B. De Moor bart.demooresat.kuleuven.ac.be J. Vandewalle joos.vandewalleesat.kuleuven.ac.be Katholieke Universiteit Leuven, Department of Electrical Engineering ESAT-SISTA, B-3001 Leuven, Belgium The Bayesian evidence framework has been successfully applied to the design of multilayer perceptrons (MLPs) in the work of MacKay. Nevertheless, the training of MLPs suffers from drawbacks like the nonconvex optimization problem and the choice of the number of hidden units. In support vector machines (SVMs) for classication, as introduced by Vapnik, a nonlinear decision boundary is obtained by mapping the input vector rst in a nonlinear way to a high-dimensional kernel-induced feature space in which a linear large margin classier is constructed. Practical expressions are formulated in the dual space in terms of the related kernel function, and the solution follows from a (convex) quadratic programming (QP) problem. In least-squares SVMs (LS-SVM s), the SVM problem formulation is modied by introducing a least-squares cost function and equality instead of inequality constraints, and the solution follows from a linear system in the dual space. Implicitly, the least-squares formulation corresponds to a regression formulation and is also related to kernel Fisher discriminant analysis. The least-squares regression formulation has advantages for deriving analytic expressions in a Bayesian evidence framework, in contrast to the classication formulations used, for example, in gaussian processes (GPs). The LS-SVM formulation has clear primal-dual interpretations, and without the bias term, one explicitly constructs a model that yields the same expressions as have been obtained with GPs for regression. In this article, the Bayesian evidence framec 2002 Massachusetts Institute of Technology Neural Computation 14, 1115– 1147 (2002) °
1116
T. Van Gestel, et al.
work is combined with the LS-SVM classier formulation. Starting from the feature space formulation, analytic expressions are obtained in the dual space on the different levels of Bayesian inference, while posterior class probabilities are obtained by marginalizing over the model parameters. Empirical results obtained on 10 public domain data sets show that the LS-SVM classier designed within the Bayesian evidence framework consistently yields good generalization performances. 1 Introduction
Bayesian probability theory provides a unifying framework to nd models that are well matched to the data and to use these models for making optimal decisions. Multilayer perceptrons (MLPs) are popular nonlinear parametric models for both regression and classication. In MacKay (1992, 1995, 1999), the evidence framework was successfully applied to the training of MLPs using three levels of Bayesian inference: the model parameters, regularization hyperparameters, and network structure are inferred on the rst, second, and third level, respectively. The moderated output is obtained by marginalizing over the model- and hyperparameters using a Laplace approximation in a local optimum. Whereas MLPs are exible nonlinear parametric models that can approximate any continuous nonlinear function over a compact interval (Bishop, 1995), the training of an MLP suffers from drawbacks like the nonconvex optimization problem and the choice of the number of hidden units. In support vector machines (SVMs), the classication problem is formulated and represented as a convex quadratic programming (QP) problem (Cristianini & Shawe-Taylor, 2000; Vapnik, 1995, 1998). A key idea of the nonlinear SVM classier is to map the inputs to a high-dimensional feature space where the classes are assumed to be linearly separable. In this high-dimensional space, a large margin classier is constructed. By applying the Mercer condition, the classier is obtained by solving a nite dimensional QP problem in the dual space, which avoids the explicit knowledge of the high-dimensional mapping and uses only the related kernel function. In Suykens and Vandewalle (1999), a least-squares type of SVM classier (LS-SVM) was introduced by modifying the problem formulation so as to obtain a linear set of equations in the dual space. This is done by taking a least-squares cost function, with equality instead of inequality constraints. The training of MLP classiers is often done by using a regression approach with binary targets for solving the classication problem. This is also implicitly done in the LS-SVM formulation and has the advantage of deriving analytic expressions within a Bayesian evidence framework in contrast with classication approaches used, as in GPs. As in ordinary ridge regression (Brown, 1977), no regularization is applied on the bias term in SVMs and LS-SVMs, which results in a centering in the kernel-induced feature space and allows relating the LS-SVM formulation to kernel Fisher dis-
Bayesian LS-SVM Classier
1117
criminant analysis (Baudat & Anouar, 2000; Mika, Ra¨ tsch, & Muller, ¨ 2001). The corresponding eigenvalues of the centered kernel matrix are obtained from kernel PCA (Sch¨olkopf, Smola, & Muller, ¨ 1998). When no bias term is used in the LS-SVM formulation, similar expressions are obtained as with kernel ridge regression and gaussian processes (GPs) for regression (Gibbs, 1997; Neal, 1997; Rasmussen, 1996; Williams, 1998). In this article, a Bayesian framework is derived for the LS-SVM formulation starting from the SVM and LS-SVM feature space formulation, while the corresponding analytic expressions in the dual space are similar, up to the centering, to the expressions obtained for GP. The primal-dual interpretations and equality constraints of LS-SVMs have also allowed, extending the LS-SVM framework to recurrent networks and optimal control (Suykens & Vandewalle, 2000; Suykens, Vandewalle, & De Moor, 2001). The regression formulation allows deriving analytic expressions in order to infer the model parameters, hyper parameters, and kernel parameters on the corresponding three levels of Bayesian inference, respectively. Posterior class probabilities of the LSSVM classier are obtained by marginalizing over the model parameters within the evidence framework. In section 2, links between kernel-based classication techniques are discussed. The three levels of inference are described in sections 3, 4, and 5. The design strategy is explained in section 6. Empirical results are discussed in section 7. 2 Kernel-Based Classication Techniques
Given a binary classication problem with classes C C and C ¡ , with corresponding class labels y D § 1, the classication task is to assign a class label to a given new input vector x 2 Rn . Applying Bayes’ formula, one can calculate the posterior class probability: P ( y | x) D
p ( x | y) P ( y ) , p ( x)
(2.1)
where P ( y ) is the (discrete) a priori probability of the classes and p (x | y) is the (continuous) probability of observing x when corresponding to class label y. The denominator p (x ) follows from normalization. The class label is then assigned to the class with maximum posterior probability: y (x ) D sign[g 0 (x ) ] D sign[P(y D C 1 | x ) ¡ P ( y D ¡1 | x) ]
(2.2)
or y (x ) D sign[g1 (x ) ] D sign[log(p(x | y D C 1) P ( y D C 1) )
¡ log(p(x | y D ¡1) P ( y D ¡1) )].
(2.3)
1118
T. Van Gestel, et al.
Given g 0 ( x ) , one obtains the posterior class probabilities from P ( y D C 1 | x ) D 12 (1 C g0 ( x) ) and P ( y D ¡1) D 12 (1 ¡ g 0 (x ) ) (Duda & Hart, 1973). When the densities p (x | y D C 1) and p (x | y D ¡1) have a multivariate normal distribution with the same covariance matrix S and corresponding mean m C and m ¡ , respectively, the Bayesian decision rule, equation 2.3, becomes the linear discriminant function, y ( x ) D sign[wT x C b],
(2.4)
with w D S ¡1 ( m C ¡ m ¡ ) and b D ¡wT ( m C C m ¡ ) / 2 C log(P(y D C 1) ) ¡ log(P ( y D ¡1) ) (Bishop, 1995; Duda & Hart, 1973). In practice, the class covariance matrix S and the mean m C and m ¡ are not known, and the linear classier wT x C b has to be estimated from given data D D f(xi , yi ) gN i D1 that consist of N C positive and N ¡ negative labels. The corresponding sets of indices with positive and negative labels are denoted by I C and I ¡ with the full index set I equal to I D I C [ I ¡ D f1, . . . , N g. Some well-known algorithms to estimate the discriminant vector w and bias term b are Fisher discriminant analysis, support vector machine (SVM) classier, and a regression approach with binary targets yi D § 1. However, when the class densities are not normally distributed with the same covariance matrix, the optimal decision boundary typically is no longer linear (Bishop, 1995; Duda & Hart, 1973). A nonlinear decision boundary in the input space can be obtained by applying the kernel trick: the input vector x 2 Rn is mapped in a nonlinear way to the high (possibly innite) dimensional feature vector Q ( x) 2 Rnf , where the nonlinear function Q (¢) : Rn ! Rnf is related to the symmetric, positive denite kernel function, K ( x1 , x2 ) D Q ( x1 ) T Q ( x2 ) ,
(2.5)
from Mercer’s theorem (Cristianini & Shawe-Taylor, 2000; Smola, Sch¨olkopf, & Muller, ¨ 1998; Vapnik, 1995, 1998). In this high-dimensional feature space, a linear separation is made. For the kernel function K, one typically has the following choices: K ( x, xi ) D xTi x (linear kernel); K ( x, xi ) D ( xTi x C 1) d (polynomial kernel of degree d 2 N); K ( x, xi ) D expf¡kx ¡ xi k22 / s 2 g (RBF kernel); or a K (x, xi ) D tanh(k xTi x C h ) (MLP kernel). Notice that the Mercer condition holds for all s 2 R and d values in the RBF (resp. the polynomial case), but not for all possible choices of k , h 2 R in the MLP case. Combinations of kernels can be obtained by stacking the corresponding feature vectors. The classication problem now is assumed to be linear in the feature space, and the classier takes the form y ( x ) D sign[wT Q (x ) C b],
(2.6)
Bayesian LS-SVM Classier
1119
where w and b are obtained by applying the kernel version of the abovementioned algorithms, where typically a regularization term wT w / 2 is introduced in order to avoid overtting (large margin 2 / wT w) in the high (and possibly innite) dimensional feature space. On the other hand, the classier, equation 2.6, is never evaluated in this form, and the Mercer condition, equation 2.5, is applied instead. In the remainder of this section, the links between the different kernel-based classication algorithms are discussed. 2.1 SVM Classiers. Given the training data f(xi , yi ) gN i D 1 with input data xi 2 Rn and corresponding binary class labels yi 2 f¡1, C 1g, the SVM classier, according to Vapnik’s original formulation (Vapnik, 1995, 1998), incorporates the following constraints (i D 1, . . . , N):
(
wT Q ( xi ) C b ¸ C 1, wT Q ( xi ) C b · ¡1,
if yi D C 1 if yi D ¡1,
(2.7)
which is equivalent to yi [wT Q ( xi ) C b] ¸ 1, ( i D 1, . . . , N ) . The classication problem is formulated as follows:
min J 1 ( w, j ) D w,b,j
N X 1 T ji w wCC 2 iD 1
( £ ¤ yi wT Q ( xi ) C b ¸ 1 ¡ ji , subject to ji ¸ 0,
(2.8) i D 1, . . . , N i D 1, . . . , N.
(2.9)
This optimization problem is solved in its dual form, and the resulting classier, equation 2.6, is evaluated in its dual representation. The variables ji are slack variables that are needed in order to allow misclassications in the set of inequalities (e.g., due to overlapping distributions). The positive real constant C should be considered as a tuning parameter in the algorithm. More details on SVMs can be found in Cristianini and Shawe-Taylor (2000), Smola et al. (1998), and Vapnik (1995, 1998). Observe that no regularization is applied on the bias term b. 2.2 LS-SVM Classiers. In Suykens & Vandewalle, 1999 the SVM classier formulation was modied basically as follows:
min J2c ( w, ec ) D
w,b,ec
N m T f X w wC e2 2 2 iD 1 c,i
(2.10)
1120
T. Van Gestel, et al.
subject to
h i yi wT Q ( xi ) C b D 1 ¡ ec,i ,
i D 1, . . . , N.
(2.11)
Besides the quadratic cost function, an important difference with standard SVMs is that the formulation now consists of equality instead of inequality constraints. The LS-SVM classier formulation, equations 2.10 and 2.11, implicitly corresponds to a regression interpretation with binary targets yi D § 1. By multiplying the error ec,i with yi and using y2i D 1, the sum squared error P 2 term N i D 1 ec,i becomes N X i D1
e2c,i D
N X i D1
( yi ec,i ) 2 D
N X i D1
± ± ²²2 e2i D yi ¡ wT Q (x ) C b ,
(2.12)
with ei D yi ¡ ( wT Q ( x) C b ).
(2.13)
Hence, the LS-SVM classier formulation is equivalent to
J2 ( w, b) D m EW C f ED ,
(2.14)
with EW D
1 T w w, 2
(2.15)
ED D
N N ± h i²2 1X 1X . e2i D yi ¡ w T Q ( xi ) C b 2 iD 1 2 i D1
(2.16)
Both m and f should be considered as hyperparameters in order to tune the amount of regularization versus the sum squared error. The solution of equation 2.14 depends on only the ratio c D f /m . Therefore, the original formulation (Suykens & Vandewalle, 1999) used only c as tuning parameter. The use of both parameters m and f will become clear in the Bayesian interpretation of the LS-SVM cost function, equation 2.14, in the next sections. Observe that no regularization is applied to the bias term b, which is the preferred form for ordinary ridge regression (Brown, 1977). The regression approach with binary targets is a common approach for training MLP classiers and also for the simpler case of linear discriminant analysis (Bishop, 1995). Dening the MSE error between wT Q ( x) C b and the
Bayesian LS-SVM Classier
1121
Bayes discriminant g 0 (x ) from equation 2.2, Z h i2 MSE D wT Q (x ) C b ¡ g 0 ( x ) p ( x ) dx,
(2.17)
it has been shown (Duda & Hart, 1973) that minimization of ED in equation 2.14 is asymptotically ( N ! 1) equivalent to minimizing equation 2.17. Hence, the regression formulation with binary targets yields asymptotically the best approximation to the Bayes discriminant, equation 2.2, in the leastsquares sense (Duda & Hart, 1973). Such an approximation typically gives good results but may be suboptimal since the misclassication risk is not directly minimized. The solution of the LS-SVM regressor is obtained after constructing the P T Lagrangian L ( w, b, eI a) D J2 ( w, e ) ¡ N i D 1 ai fyi ¡ [w Q ( xi ) C b] ¡ ei g, where ai 2 R are the Lagrange multipliers. The conditions for optimality are: 8 > > @L > > > > @w > > > > > > @L > < @b > > > L @ > > > > > @ei > > > @L > > : @ai
N X
D0
!
D0
!
D0
!
ai D c ei ,
i D 1, . . . , N
D0
!
yi D wT Q ( xi ) C b C ei D 0,
i D 1, . . . , N.
wD N X i D1
ai Q (xi )
iD 1
ai D 0
(2.18)
As in standard SVMs, we never calculate w or Q (xi ) . Therefore, we eliminate w and e, yielding a linear Karush-Kuhn-Tucker system instead of a QP problem: µ
0 1v
1Tv C V c ¡1 IN
¶µ ¶ µ ¶ 0 b D a Y
(2.19)
with Y D [y1 I . . . I yN ],
1v D [1I . . . I 1],
e D [e1 I . . . I eN ],
a D [a1 I . . . I aN ],
(2.20)
and where Mercer’s condition, equation 2.5, is applied within the kernel matrix V 2 RN£N , Vij D Q (xi ) T Q ( xj ) D K ( xi , xj ) .
(2.21)
1122
T. Van Gestel, et al.
The LS-SVM classier is then constructed as follows: y ( x ) D sign
" N X i D1
# ai yi K (x, xi ) C b .
(2.22)
In numerical linear algebra, efcient algorithms exist for solving largescale linear systems (Golub & Van Loan, 1989). The system, equation 2.19, can be reformulated into two linear systems with positive denite data matrices, so as to apply iterative methods such as the Hestenes-Stiefel conjugate gradient algorithm (Suykens, 2000). LS-SVM classiers can be extended to multiple classes by dening additional output variables. Although sparseness is lost due to the use of a 2-norm, a sparse approximation of the LSSVM can be obtained by sequentially pruning the support value spectrum (Suykens, 2000) without loss of generalization performance. 2.3 Gaussian Processes for Regression. When one uses no bias term b in the regression formulation (Cristianini & Shawe-Taylor, 2000; Saunders, Gammerman, & Vovk, 1998), the support values a¤ are obtained from the linear system,
³ VC
m f
´ IN a¤ D Y.
(2.23)
The output of the LS-SVM regressor for a new input x is given by yO ( x ) D
N X i D1
ai¤ K ( x, xi ) D h ( x ) T a¤ ,
(2.24)
with h ( x ) D [K ( x, x1 )I . . . I K (x, xN ) ]. For classication purposes one can use the interpretation of an optimal least-squares approximation, equation 2.17, to the Bayesian decision rule, and the class label is assigned as follows: y D sign[yO (x ) ]. Observe that the result of equation 2.24 is equivalent with the gaussian process (GP) formulation (Gibbs, 1997; Neal, 1997; Rasmussen, 1996; Sollich, 2000; Williams, 1998; Williams & Barber, 1998) for regression. In GPs, one assumes that the data are generated as yi D yO ( x ) C ei . Given N data points f(xi , yi ) gN i D 1 , the predictive mean for a new input x is given ¡1 T by yO ( x ) D h ( x ) CN YR , with h (x ) D [C(x, x1 ) I . . . I C ( x, xN ) ] and the matrix CN 2 RN£N with CN,ij D C ( xi , xj ), where C ( xi , xj ) is the parameterized covariance function, C (xi , xj ) D
1 1 K ( xi , xj ) C dij , m f
(2.25)
Bayesian LS-SVM Classier
1123
with dij the Kronecker delta and i, j D 1, . . . , N. The predictive mean is obtained as yO (x ) D
1 h ( x) T m
³
1 1 V C IN m f
´ ¡1
Y.
(2.26)
By combination of equations 2.23 and 2.24, one also obtains equation 2.26. The regularization term EW is related to the covariance matrix of the inputs, while the error term ED yields a ridge regression estimate in the dual space (Saunders et al., 1998; Suykens & Vandewalle, 1999; Suykens, 2000). With the results of the next sections, one can also show that the expression for the variance in GP is equal to the expressions for the LS-SVM without the bias term. Compared with the GP classier formulation, the regression approach allows the derivation of analytical expressions on all three levels of inference. In GPs, one typically uses combinations of kernel functions (Gibbs, 1997; Neal, 1997; Rasmussen, 1996; Williams & Barber, 1998), while a positive constant is added when there is a bias term in the regression function. In Neal (1997), the hyperparameters of the covariance function C and the variance 1 /f of the noise ei are obtained from a sampled posterior distribution of the hyperparameters. Evidence maximization is used in Gibbs (1997) to infer the hyperparameters on the second level. In this article, the bias term b is inferred on the rst level, while m and f are obtained from a scalar optimization problem on the second level. Kernel parameters are determined on the third level of inference. Although the results from the LS-SVM formulation without bias term and gaussian processes are identical, LS-SVMs explicitly formulate a model in the primal space. The resulting support values ai of the model give further insight in the importance of each data point and can be used to obtain sparseness and detect outliers. The explicit use of a model also allows dening, in a straightforward way the effective number of parameters c ef f in section 4. In the LS-SVM formulation, the bias term is considered a model parameter and is obtained on the rst level of inference. As in ordinary ridge regression (Brown, 1997), no regularization is applied on the bias term b in LS-SVMs, and a zero-mean training set error is obtained from equation 2.18: PN i D 1 ei D 0. It will become clear that the bias term also results in a centering of the Gram matrix in the feature space, as is done in kernel PCA (Sch¨olkopf et al., 1998). The corresponding eigenvalues can be used to derive improved generalization bounds for SVM classiers (Sch¨olkopf, Shawe-Taylor, Smola, & Williamson, 1999). The use of the unregularized bias term also allows the derivation of explicit links with kernel Fisher discriminant analysis (Baudat & Anouar, 2000; Mika et al., 2001). 2.4 Regularized Kernel Fisher Discriminant Analysis. The main concern in Fisher discriminant analysis (Bishop, 1995; Duda & Hart, 1973) is
1124
T. Van Gestel, et al.
Figure 1: Two gaussian distributed classes with the same covariance matrix are separated by the hyperplane wT Q ( x ) C b D 0 in the feature space. The class center of classes ¡1 and C 1 is located on the hyperplanes wT Q ( x ) C b D ¡1 and wT Q ( x ) C b D 1, respectively. The projections of the features onto the linear discriminant result in gaussian distributed errors with variance f ¡1 around the targets ¡1 and C 1.
to nd a linear discriminant w that yields an optimal discrimination between the two classes C C and C ¡ depicted in Figure 1. A good discriminant maximizes the distance between the projected class centers and minimizes the overlap Given the estimated class P between both distributions. P ( ) O centers mO C D Q and x N m D ¡ i / C i2I C i2I¡ Q ( xi ) / N ¡ , one maximizes 2 T OC ¡m O ¡ ) ) between the projected class centers the squared distance ( w ( m and minimizes the regularized scatter s around the class centers, sD
X± i2I C
Cc
wT (Q (xi ) ¡ mO C )
¡1
²2
C
²2 X± O ¡) wT ( Q ( x C i ) ¡ m
i2I¡
T
(2.27)
w w,
where the regularization term c ¡1 wT w is introduced so as to avoid overtting in the high-dimensional feature space. The scatter s is minimized so as to obtain a small overlap between the classes. The feature space expression for the regularized kernel Fisher discriminant is then found by maximizing ¡
OC ¡m O ¡) wT ( m max J FDA (w ) D w s
¢2 D
O C ¡ mO ¡ ) ( m OC ¡m O ¡) T w wT ( m , T w SW C w
(2.28)
Bayesian LS-SVM Classier
1125
P P O C ) (Q ( xi ) ¡ m O C ) T C i2I¡ (Q ( xi ) ¡ mO ¡ ) (Q ( xi ) ¡ with SW C D i2I C (Q ( xi ) ¡ m O ¡ ) T C c ¡1 Inf . The solution to the generalized Rayleigh quotient, equam tion 2.28, follows from a generalized eigenvalue problem in the feature O ¡) ( m O C ¡ mO ¡ ) T w D lSW C w, from which one obtains space ( mO C ¡ m O ¡ ). w D SW¡1C (mO C ¡ m
(2.29)
As the mapping Q is typically unknown, practical expressions need to be derived in the dual space, for example, by solving a generalized eigenvalue problem (Baudat & Anouar, 2000; Mika et al., 2001). Also the SVM formulation has been related to Fisher discriminant analysis (Shashua, 1999). The bias term b is not determined by Fisher discriminant analysis. Fisher discriminant analysis is typically used as a rst step, which yields the optimal linear discriminant between the two classes. The bias term b has to be determined in the second step so as to obtain an optimal classier. It can be easily shown in the feature space that the LS-SVM regression formulation, equation 2.14, yields the same discriminant vector w. Dening U D [Q (x1 ) , . . . , Q ( xN ) ] 2 Rnf £N , the conditions for optimality in the primal space are "
UUT C c
¡1
1Tv U T
In f
U 1v N
#" w b
#
" D
UY U 1v
# .
(2.30)
O C C N¡m O ¡ ) / N C (N C ¡ From the second condition, we obtain b D wT ( N C m N ¡ ) / N. Substituting this into the rst condition, one obtains ³ ´ N C N¡ N C N¡ T ( ) ( ) (m O O O O O C ¡m O ¡) , ¡ ¡ SW C C m m m m wD2 C C ¡ ¡ 2 N N2 which yields, up to a scaling constant, the same discriminant vector w as O C ¡m O ¡ ) . In the regression equation 2.29 since (mO C ¡ mO ¡ ) (mO C ¡ mO ¡ ) T ) w / ( m formulation, the bias b is determined so as to obtain an optimal least-squares approximation, equation 2.17, for the discriminant function, equation 2.2. 3 Probabilistic Interpretation of the LS-SVM Classier (Level 1)
A probabilistic framework is related to the LS-SVM classier. The outline of our approach is similar to the work of Kwok (1999, 2000) for SVMs, but there are signicant differences concerning the Bayesian interpretation of the cost function and the algebra involved for the computations in the feature space. First, Bayes’ rule is applied in order to obtain the LS-SVM cost function. The moderated output is obtained by marginalizing over w and b. 3.1 Inference of the Model Parameters w and b. Given the data points D D f(xi , yi ) gN i D1 and the hyperparameters m and f of the model H , the model
1126
T. Van Gestel, et al.
parameters w and b are estimated by maximizing the posterior p ( w, b | D, log m , log f , H ). Applying Bayes’ rule at the rst level (Bishop, 1995; MacKay, 1995), we obtain1 p ( w, b | D, log m , log f , H ) D
p (D | w, b, log m , log f , H ) p ( w, b | log m , log f , H ) , p ( D | log m , log f , H )
(3.1)
where the evidence p (D | log m , log f , H ) is a normalizing constant such that the integral over all possible w and b values is equal to 1. We assume a separable gaussian prior, which is independent of the hyperparameter f , that is, p (w, b | log m , log f , H ) D p ( w | log m , H ) p ( b | log sb , H ) , where sb ! 1 to approximate a uniform distribution. By the choice of the regularization term EW in equation 2.15, we obtain for the prior with sb ! 1: Á ! ± m ² nf ± m ² 1 b2 2 T exp ¡ w w q exp ¡ 2 p ( w, b | log m , H ) D 2p 2 2sb 2p sb2
_
± m ² nf ± m ² 2 exp ¡ wT w . 2p 2
(3.2)
To simplify the notation, the step of taking the limit of sb ! 1 is already made in the remainder of this article. The probability p (D | w, b, log m , log f , H ) is assumed to depend only on w, b, f , and H . We assume that the data points are independent p ( D | Q w, b, log f , H ) D N i D 1 p ( xi , yi | w, b, log f , H ) . In order to obtain the leastsquares cost function, equation 2.16, from the LS-SVM formulation, it is assumed that the probability of a data point is proportional to p ( xi , yi | w, b, log f , H ) _ p ( ei | w, b, log f , H ) ,
(3.3)
where the normalizing constant is independent of w and b. A gaussian distribution is taken for the errors ei D yi ¡ (wT Q ( xi ) C b) from equation 2.13: r
p ( ei | w, b, log f , H ) D
f
2p
³
exp ¡
f e2i
2
´
.
(3.4)
An appealing way to interpret this probability is depicted in Figure 1. It is assumed that the w and b are determined in such a way that the class 1 The notation p(¢ | ¢, log m , log f , ¢) used here is somewhat different from the notation p (¢ | ¢, m , f , ¢) used in MacKay (1995). We prefer this notation since m and f are (positive) scale parameters (Gull, 1988). By doing so, a uniform notation over the three levels of inference is obtained. The change in notation does not affect the results.
Bayesian LS-SVM Classier
1127
O ¡ and m O C are mapped onto the targets ¡1 and C 1, respectively. centers m The projections wT Q ( x ) C b of the class elements Q ( x) of the multivariate gaussian distributions are then normally disturbed around the corresponding targets with variance 1 /f . One can then write p (xi , yi | w, b, f , H ) D p ( xi | yi , w, b, f , H ) P ( yi ) D p ( ei | w, b, f , H ) P ( yi ), where the errors ei D yi ¡ (wT Q ( xi ) C b) are obtained by projecting the feature vector Q ( xi ) onto the discriminant function wT Q ( xi ) C b and comparing them with the target yi . Given the binary targets yi 2 f¡1, C 1g, the error ei is a function of the input xi in the classier interpretation. Assuming a multivariate gaussian distribution of feature vector Q ( xi ) in the feature space, the errors ei are also gaussian distributed, as is depicted in Figure 1 (Bishop, 1995; Duda & Hart, O ¡ C b D ¡1 and wT mO C C b D C 1 1973). However, the assumptions that wT m may not always hold and will be relaxed in the next section. By combining equations 3.2 and 3.4 and neglecting all constants, Bayes’ rule, equation 3.1, for the rst level of inference becomes Á
N m f X p (w, b | D, log m , log f , H ) _ exp ¡ wT w ¡ e2 2 2 i D1 i
!
D exp(¡J 2 ( w, b) ) .
(3.5)
The maximum a posteriori estimates wMP and bMP are then obtained by minimizing the negative logarithm of equation 3.5. In the dual space, this corresponds to solving the linear set of equations 2.19. The quadratic cost function, equation 2.14, is linear in w and b and can be related to the posterior p (w, b | D, m , f , H ) ³ ´ 1 1 exp ¡ gT Q ¡1 g , D p 2 (2p ) n f C 1 det Q
(3.6)
with2 g D [w ¡ wMP I b ¡ bMP ] and Q D covar(w, b) D E ( gT g ) , taking the expectation over w and b. The covariance matrix Q is related to the Hessian H of equation 2.14: " Q D H ¡1 D
H11
H12
T H12
H22
# ¡1
2 6
D6 4
3 ¡1
@2 J2
@2 J2
@w2 @2 J2
@w@b 7 @2 J2 5
@b@w
7
.
(3.7)
@b2
When using MLPs, the cost function is typically nonconvex, and the covariance matrix is estimated using a quadratic approximation in the local optimum (MacKay, 1995). 2
The Matlab notation [XI Y] is used, where [XI Y] D [XT YT ]T .
1128
T. Van Gestel, et al.
3.2 Class Probabilities for the LS-SVM Classier. Given the posterior probability, equation 3.6, of the model parameters w and b, we will now integrate over all w and b values in order to obtain the posterior class probability P ( y | x, D, m , f , H ) . First, it should be remarked that the assumption O C C b D C 1 and wT m O ¡ C b D ¡1 may not always be satised. This typwT m 6 ically occurs when the training set is unbalanced (N C D N ¡ ). In this case, the discriminant shifts toward the a priori most likely class so as to yield the optimal least-squares approximation (see equation 2.17). Therefore, we will use r ³ ´ O ) ) 2 f f ( wT ( Q ( x) ¡ m exp ¡ p ( x | y D 1, w, b, log f, H ) D 2p 2 r ³ ´ 2 f fe exp ¡ (3.8) D 2p 2
with e D wT (Q ( x ) ¡ mO ) by denition, where f¡1 is the variance of e. The notation is used to denote either C or ¡, since analogous expressions are obtained for classes C C and C ¡, respectively. In this article, we assume
f C D f¡ D f¤ .
Since e is a linear combination of the gaussian distributed w, marginalizing over w will yield a gaussian distributed e with mean me and variance se2 . The expression for the mean is O ) D me D wTMP (Q ( x ) ¡ m
N X i D1
O d. ai K ( x, xi ) ¡ m
(3.9)
PN P O d D N1 with m i D 1 ai j2I K ( xi , xj ), while the corresponding expression for the variance is O ]T Q11 [Q ( x ) ¡ m O ]. se2 D [Q ( x ) ¡ m
(3.10)
The expression for the upper left n f £n f block Q11 of the covariance matrix Q is derived in appendix A. By using matrix algebra and applying the Mercer condition, we obtain X se2 D m ¡1 K ( x, x ) ¡ 2m ¡1 N¡1 K (x, xi ) C m ¡1 N¡2 1Tv V (I, I) 1v i2I
1 T ¡ [h (x ) T ¡ 1 V (I, I )] N v £ MUG [m ¡1 Ineff ¡ (m Inef f C f DG ) ¡1 ] T £ UG M[h (x ) ¡
1 V (I , I) 1v ], N
(3.11)
Bayesian LS-SVM Classier
1129
where 1v is a vector of appropriate dimensions with all elements equal to one and where we used the Matlab index notation X ( Ia , Ib ) , which selects the corresponding rows Ia and columns Ib of the matrix X. The vector h ( x ) 2 RN and the matrices UG 2 RN£Nef f and DG 2 RNeff £Nef f are dened as follows: hi ( x ) D K ( x, xi ) , ¡1
UG ( :, i) D lG,i2 vG,i ,
i D 1, . . . , N
(3.12)
i D 1, . . . , Nef f · N ¡ 1
(3.13)
DG D diag([lG,1 , . . . , lG,Neff ]) ,
(3.14)
where vG,i and lG,i are the solutions to the eigenvalue problem (see equation A.4) MVMºG,i D lG,i vG,i ,
i D 1, . . . , Nef f · N ¡ 1,
(3.15)
with VG D [vG,1 , . . . , vG,Nef f ] 2 RN£Nef f . The vector Y and the matrix V are dened in equations 2.20 and 2.21, respectively, while M 2 RN£N is the idempotent centering matrix M D IN ¡ 1 / N1v 1Tv with rank N ¡ 1. The number of nonzero eigenvalues is denoted by Nef f < N. For rather large data sets, one may choose to reduce to computational requirements and approximate the variance sz2 by using only the most signicant eigenvalues (lG,i À mf ) in the above expressions. In this case, Nef f denotes the number of most signicant eigenvalues (see appendix A for details). The conditional probabilities p ( x | y D C 1, D, log m , log f , log f¤ , H ) and p ( x | y D ¡1, D, log m , log f , log f¤ , H ) are then equal to p (x | y D 1, D, log m , log f , log f, H ) Á D (2p
(f¡1 C
1 se2 ) ) ¡ 2
exp ¡
m2e
2(f¡1 C se2 )
! (3.16)
with either C or ¡, respectively. By applying Bayes’ rule, equation 2.1, the following class probabilities of the LS-SVM classier are obtained: P ( y | x, D, log m , log f , log f¤ , H ) D
P ( y ) p (x | y, D, log m , log f , log f¤ , H ) , p (x | D, log m , log f , log f¤ , H )
(3.17)
where the denominator p ( x | D, log m , log f , log f¤ , H ) D P ( y D C 1) p ( x | y D C 1, D, log m , log f , log f¤ , H ) C P ( y D ¡1) p ( x | y D ¡1, D, log m , log f , log f¤ , H ) follows from normalization. Substituting expression 3.16 for D C and D ¡ into expression 3.17, a quadratic expression is obtained since
1130
T. Van Gestel, et al.
6 f¤¡1 C s 2 . When s 2 ’ s 2 , one can dene s 2 D f¤¡1 C se2¡ D eC e¡ eC e
one obtains the linear discriminant function " N X O d¡ mO d C C m ai K ( x, xi ) ¡ y ( x ) D sign 2 i D1 C
f ¡1 C se2
O d C ¡ mO d¡ m
log
¶ P ( y D C 1) . P ( y D ¡1)
q
se2C se2¡ , and
(3.18)
The second and third terms in equation 3.18 correspond to the bias term b in the LS-SVM classier formulation, equation 2.22, where the bias term was determined to obtain an optimal least-squares approximation to the Bayes discriminant. The decision rules, equations 3.17 and 3.18, allow taking into account prior class probabilities in a more elegant way. This also allows adjusting the bias term for classication problems with different prior class probabilities in the training and test set. Due to the marginalization over w, the bias term correction is also a function of the input x since se2 is a function of x. The idea of a (static) bias term correction has also been applied in Evgeniou, Pontil, Papageorgiou, & Poggio, 2000 in order to improve the validation set performance. In Mukherjee et al., 1999 the probabilities p (e | y, wMP , bMP , m , f , H ) were estimated using leave-one-out cross validation given the obtained SVM classier, and the corresponding classier decision was made in a similar way as in equation 3.18. A simple density estimation algorithm was used, and no gaussian assumptions were made, while no marginalization over the model parameters was performed. A bias term correction was also applied in the softmax interpretation for the SVM output (Platt, 1999) using a validation set. Given the asymptotically optimal least-squares approximation, one can approximate the class probabilities P ( y D C 1 | x, w, b, D, log m , log f , H ) D (1 C g 0 ( x ) ) / 2 replacing g 0 (x ) by wTMP Q (x ) C bMP for the LS-SVM formulation. However, such an approach does not yield true probabilities that are between 0 and 1 and sum up to 1. Using a softmax function (Bishop, 1995; MacKay, 1992, 1995), one obtains P ( y D C 1 | x, w, b, D, log m , log f , H ) D (1 C exp(¡(wT Q (x ) C b) ) ) ¡1 and P ( y D ¡1 | x, w, b, D, log m , log f , H ) D (1 C exp( C ( wT Q (x ) C b) ) ) ¡1 . In order to marginalize over the model parameters in the logistic functions, one can use the approximate expressions of MacKay (1992, 1995) in combination with the expression for the moderated output of the LS-SVM regressor derived in Van Gestel et al., (2001). In the softmax interpretation for SVMs (Platt, 1999), no marginalization over the model parameters is applied, and the bias term is determined on a validation set. Finally, because of the equivalence between classication costs and prior probabilities (Duda & Hart, 1973), the results for the moderated output of the LS-SVM classier can be extended in a straightforward way in order to take different classication costs into account.
Bayesian LS-SVM Classier
1131
For large-scale data sets, the computation of the eigenvalue decomposition (see equation 3.15) may require long computations, and one may choose to compute only the largest eigenvalues and corresponding eigenvectors using an expectation-maximization approach (Rosipal & Girolami, 2001). This will result in an increased variance, as explained in appendix A. An alternative approach is to use the “cheap and chearful” approach described in MacKay (1995). 4 Inference of the Hyperparameters m and f (Level 2)
Bayes’ rule is applied on the second level of inference to infer the most likely m MP and fMP values from the given data D. The differences with the expressions obtained in MacKay (1995) are due to the fact that no regularization is applied on the bias term b and that all practical expressions are obtained in the dual space by applying the Mercer condition. Up to a centering, these expressions are similar to the expressions obtained with GP for regression. By combination of the conditions for optimality, the minimization problem in m and f is reformulated into a scalar minimization problem in c D f /m . 4.1 Inference of m and f . In the second level of inference, Bayes’ rule is applied to infer the most likely m and f values from the data:
p (log m , log f | D, H ) D
p (D | log m , log f , H ) p (log m , log f | H ) p (D | H )
_ p (D | log m , log f , H ) .
(4.1)
Because the hyperparameters m and f are scale parameters (Gull, 1988), we take a uniform distribution in log m and log f for the prior p (log m , log f | H ) D p (log m | H ) p (log f | H ) in equation 4.1. The evidence p ( D | H ) is again a normalizing constant, which will be needed in level 3. The probability p ( D | log m , log f , H ) is equal to the evidence in equation 3.1 of the previous level. Substituting equations 3.2, 3.4, and 3.6 into 4.1, we obtain: p p m nf f N exp(¡J2 (w, b ) ) p det H exp(¡ 12 gT Hg) p m nf f N _ p exp(¡J2 ( wMP , bMP ) ) , det H
p (log m , log f | D, H ) _
where J2 (w, b ) D J2 ( wMP , bMP ) C 12 gT Hg with g D [w ¡ wMP I b ¡ bMP ]. The expression for det H is given in appendix B and is equal to det H D Nm nf ¡Neff f QNeff (m C f lG,i ) , where the Nef f eigenvalues lG,i are the nonzero eigenvalues iD 1
1132
T. Van Gestel, et al.
of MVM. Taking the negative logarithm of p (log m , log f | D, H ) , the optimal parameters m MP and fMP are found as the solution to the minimization problem: min J3 (m , f ) D m EW ( wMP ) C f ED (wMP , bMP ) m ,f
N
C
ef f Nef f 1X N ¡1 log(m C f lG,i ) ¡ log m ¡ log f . 2 i D1 2 2
(4.2)
In Appendix B it is also shown that the level 1 cost function evaluated in wMP and bMP can be written as m EW ( wMP ) C f ED ( wMP , bMP ) D 12 YT M (m ¡1 MVM C f ¡1 IN ) ¡1 MY. The cost function J 3 from equation 4.2 can be written as ³ ´ ¡1 1 T 1 1 min J3 (m , f ) D Y M MVM C IN MY m ,f 2 m f ³ ´ 1 1 1 1 1 C log det MVM C IN ¡ log , 2 m f 2 f
(4.3)
where the last term is due to the extra bias term b in the LS-SVM formulation. Neglecting the centering matrix M, the rst two terms in equation 4.3 correspond to the level 2 cost function used in GP (Gibbs, 1997; Rasmussen, 1996; Williams, 1998). Hence, the use of the unregularized bias term b in the SVM and LS-SVM formulation results in a centering matrix M in the obtained expressions compared to GP. The eigenvalues lG,i of the centered Gram matrix are also used in kernel PCA (Sch¨olkopf et al., 1998), and can also be used to infer improved error bounds for SVM classiers (Sch¨olkopf et al., 1999). In the Bayesian framework, the capacity is controlled by the prior. The effective number of parameters (Bishop, 1995; MacKay, 1995) is equal P to c ef f D li,u / li,r , where li,u and li,r are the eigenvalues of Hessians of the unregularized cost function (Ju D f ED ) and regularized cost function (Jr D m EW C f ED ), respectively. For the LS-SVM, the effective number of parameters is equal to
c ef f D 1 C
Neff X iD 1
Nef f X c MP lG,i fMP lG,i D 1C , m MP C fMP lG,i 1 C c MP lG,i iD 1
(4.4)
with c D f / m . The term C 1 is obtained because no regularization on the bias term b is applied. Notice that since Nef f · N ¡1, the effective number of parameters c ef f can never exceed the number of given training data points, c ef f · N, although we may choose a kernel function K with possibly n f ! 1 degrees of freedom in the feature space.
Bayesian LS-SVM Classier
1133
The gradient of the cost function J3 (m , f ) is (MacKay, 1992): @J3 @m @J3 @f
D EW ( wMP ) C
Nef f Nef f 1X 1 ¡ 2 iD 1 m C f lG,i 2m
(4.5)
Nef f
D ED ( wMP , bMP ) C
1X lG,i N¡1 ¡ . 2 iD 1 m C f lG,i 2f
(4.6)
Putting the partial derivatives 4.5 and 4.6 equal to zero, we obtain the following relations in the optimum of the level 2 cost function: 2m MP EW ( wMP ) D c ef f ¡ 1 and 2fMP ED ( wMP , bMP ) D N ¡ c ef f . The last equality can be viewed as PN 2 ¡1 the Bayesian estimate of the variance fMP D i D 1 ei / ( N ¡ c ef f ) of the noise ei . While this yields an implicit expression for the optimal fMP for the regression formulation, this may not be equal to the variance f¤ since the targets class centers. Therefore, § 1 do not necessarily correspond to the projected P P we will use the estimate f¤¡1 D (N ¡ c ef f ) ¡1 ( i2I C e2i, C C i2I¡ e2i, ¡ ) in the remainder of this article. Combining both relations, we obtain that for the optimal m MP , fMP and c MP D fMP / m MP : 2m MP [EW ( wMP ) C c MP ED (wMP , bMP ) ] D N ¡ 1.
(4.7)
4.2 A Scalar Optimization Problem in c D f / m . We reformulate the optimization problem, equation 4.2, in m and f into a scalar optimization problem in c D f /m . Therefore, we rst replace that optimization problem by an optimization problem in m and c . We can use that EW ( wMP ) and ED ( wMP , bMP ) in the optimum of equation 2.14 depend on only c . Since in the optimum equation 4.7 also holds, we have the search for the optimum only along this curve in the (m , c ) space. By elimination of m from equation 4.7, the following minimization problem is obtained in a straightforward way:
min J 4 (c ) D c
N¡1 X i D1
µ ¶ 1 C ( N ¡ 1) log[EW ( wMP ) C c ED ( wMP , bMP ) ] (4.8) log lG,i C c
with lG,i D 0 for i > Nef f . The derivative as
@J4 @c
is obtained in a similar way
@J3 : @m
@J4 @c
D¡
N¡1 X
1
iD 1
c C lG,ic 2
C ( N ¡ 1)
ED ( wMP , bMP ) . ( EW wMP ) C c ED (wMP , bMP )
(4.9)
Due to the second logarithmic term, this cost function is not convex, and it is
1134
T. Van Gestel, et al.
useful to start from different initial values forc . The condition for optimality (@J4 / @c D 0) is c MP D
N ¡ c ef f c ef f
EW ( wMP ) . ¡ 1 ED ( wMP , bMP )
(4.10)
We also need the expressions for ED and EW in equations 4.8 and 4.9. It is explained in appendix B that these terms can be expressed in terms of the output vector Y and the eigenvalue decomposition of the centered kernel matrix MVM: ED (wMP , bMP ) D
1 T Y MVG (DG C c 2c 2
¡1
T Inef f ) ¡2 VG MY
1 T T Y MVG DG (DG C c ¡1 Inef f ) ¡2 VG MY 2 1 T EW ( wMP ) C c ED (wMP , bMP ) D YT MVG ( DG C c ¡1 Inef f ) ¡1 VG MY. 2 EW ( wMP ) D
(4.11) (4.12) (4.13)
When the eigenvalue decomposition, equation 3.15, is calculated, the optimization, equation 4.8, involves only vector products that can be evaluated very quickly. Although the eigenvalues lG,i have to be calculated only once, their calculation in the eigenvalue problem, equation 3.15, becomes computationally expensive for large data sets. In this case, one can choose to calculate only the largest eigenvalues in equation 3.15 using an expectation maximization approach (Rosipal & Girolami, 2001), while the linear system, equation 2.19, can be solved using the Hestenes-Stiefel conjugate gradient algorithm (Suykens, 2000). The obtained a and b can also be used to derive an P ai 1 PN 2 alternative expression for ED D 2c1 2 N i D1 ai and E W D 2 i D 1 ai ( yi ¡ c ¡bMP ) instead of using equations 4.11 and 4.12. 5 Bayesian Model Comparison (Level 3)
After determination of the hyperparameters m MP and fMP on the second level of inference, we still have to select a suitable model H. For SVMs, different models correspond to different kernel functions K, for example, a linear kernel or an RBF kernel with tuning parameter s. We describe how to rank different models Hj (j D 1, 2, . . ., corresponding to, e.g., RBF kernels with different tuning parameters sj ) in the Bayesian evidence framework (MacKay, 1999). By applying Bayes’ rule on the third level, we obtain the posterior for the model Hj : p ( Hj | D ) _ p ( D | Hj ) p ( Hj ) .
(5.1)
Bayesian LS-SVM Classier
1135
At this level, no evidence or normalizing constant is used since it is computationally infeasible to compare all possible models Hj . The prior p (Hj ) over all possible models is assumed to be uniform here. Hence, equation 5.1 becomes p ( Hj | D ) _ p (D | Hj ) . The likelihood p (D | Hj ) corresponds to the evidence (see equation 4.1) of the previous level. A separable gaussian prior p (log m MP , log fMP | Hj ) with error bars slog m and slog f is assumed for all models Hj . To estimate the posterior analytically, it is assumed (MacKay, 1999) that the evidence p (log m , log f | D, Hj ) can be very well approximated by using a separable gaussian with error bars slog m |D and slog f |D . As in section 4, the posterior p ( D | Hj ) then becomes (MacKay, 1995,1999) p (D | Hj ) _ p ( D | log m MP , log fMP , Hj )
slog m |D slog f |D slog m slog f
.
(5.2)
Ranking of models according to model quality p ( D | Hj ) is thus based on the goodness of t p ( D | log m MP , log fMP , Hj ) from the previous level and the s m | D slog f | D Occam factor log slog m slog f (Gull, 1988; MacKay, 1995,1999). Following a similar reasoning as in MacKay (1999), the error bars slog m |D 2 2 2 and slog f |D can be approximated as follows: slog m |D ’ c ¡1 and slog f |D ’ eff
2 N¡c eff
. Using equations 4.1 and 4.7 in 5.2 and neglecting all constants yields v u u p (D | Hj ) _ t
Neff
(c ef f
N¡1 m MP fMP . QNef f ¡ 1) ( N ¡ c ef f ) iD1 (m MP C fMP lG,i )
(5.3)
6 Design and Application of the LS-SVM Classier
In this section, the theory of the previous sections is used to design the LSSVM classier in the Bayesian evidence framework. The obtained classier is then used to assign class labels and class probabilities to new inputs x by using the probabilistic interpretation of the LS-SVM classier. 6.1 Design of the LS-SVM Classier in the Evidence Framework. The
design of the LS-SVM classier consists of the following steps: 1. The inputs are normalized to zero-mean and unit variance (Bishop, 1995). The normalized training data are denoted by D D f(xi , yi )gN iD 1 , with xi the normalized inputs and yi 2 f¡1, 1g the corresponding class label. 2. Select the model Hj by choosing a kernel type Kj (possibly with a kernel parameter, e.g., sj for an RBF-kernel). For this model Hj , the optimal hyperparameters m MP and fMP are estimated on the second
1136
T. Van Gestel, et al.
level of inference. This is done as follows. (a) Estimate the Nef f important eigenvalues (and eigenvectors) of the eigenvalue problem, equation 3.15, to obtain DG (and VG ). (b) Solve the scalar optimization problem, equation 4.8, in c D f /m with cost function 4.8 and gradient 4.9. (c) Use the optimal c MP to calculate m MP from equation 4.7, while fMP D m MPc MP . Calculate the effective number of parameters c ef f from equation 4.4. 3. Calculate the model evidence p ( D | Hj ) from equation 5.3. 4. For a kernel Kj with tuning parameters, rene the tuning parameters. For example, for the RBF kernel with tuning parameter sj , rene sj such that a higher model evidence p ( D | Hj ) is obtained. This can be done by maximizing the model evidence with respect to sj by evaluating the model evidence for the rened kernel parameter starting from step 2a. 5. Select the model Hj with maximal model evidence p ( D | Hj ) . Go to step 2, unless the best model has been selected. For a kernel function without tuning parameter, like the linear kernel and polynomial kernel with (already) xed degree d, steps 2 and 4 are trivial, since no tuning parameter of the kernel has to be chosen in step 2 and no rening of the tuning parameter is needed in step 4. The model evidence obtained at step 4 can then be used to rank the different kernel types and select the most appropriate kernel function. 6.2 Decision Making with the LS-SVM Classier. The designed LSSVM classier Hj is now used to calculate class probabilities. By combination of these class probabilities with Bayesian decision theory (Duda & Hart, 1973), class labels are assigned in an optimal way. The classication is done in the following steps:
1. Normalize the input in the same way as the training data D. The normalized input vector is denoted by x. 2. Assuming that the parameters a, bMP , m MP , fMP , c MP , c ef f , DG , UG , Nef f are available from the design of the model Hj , one can calculate me C , me¡ , se2C and se2¡ from equations 3.9 and 3.11, respectively. Compute f¤ P P from f¤¡1 D ( N ¡ c ef f ) ¡1 ( i2IC e2i, C C i2I¡ e2i,¡ ) . 3. Calculate p (x | y D C 1, D, log m MP , log fMP , log f¤ , H ) and p ( x | y D C 1, D, log m , log f , log f¤ , H ) from equation 3.16.
4. Calculate P ( y | x, D, Hj ) from equation 3.17 using the prior class probabilities P ( y D C 1) and P ( y D ¡1). When these prior class probabilities are not available, compute P ( y D C 1) D N C / N and P ( y D ¡1) D N ¡ / N. 5. Assign the class label to class with maximal posterior P ( y | x, D, Hj ) .
Bayesian LS-SVM Classier
1137
Figure 2: Contour plot of the posterior class probability P ( y D C 1 | x, D, H ) for the rsy data set. The training data are marked with C and £ for class y D C 1 and y D ¡1, respectively.
7 Examples
The synthetic binary classication benchmark data set from Ripley (1996) is used to illustrate the theory of this article. Randomized test set performances of the Bayesian LS-SVM are reported on 10 binary classication data sets. 7.1 Design of the Bayesian LS-SVM : A Case Study. We illustrate the design of the LS-SVM classier within the evidence framework on the synthetic data set (rsy) from Ripley (1996). The data set consists of a training and test set of N D 250 and Ntest D 1000 data points, respectively. There are two inputs (n D 2), and each class is an equal mixture of two normal distributions with the same covariance matrices. Both classes have the same prior probability P ( y D C 1) D P ( y D ¡1) D 1 / 2. The training data are visualized in Figure 2, with class C 1 and class ¡1 depicted by C and £, respectively. In a rst step, both inputs x (1) and x (2) were normalized to zero mean and unit variance (Bishop, 1995). For the kernel function K of the model H , an RBF kernel with parameter s was chosen. Assuming a at prior on the value of log s, the optimal sMP was selected by maximizing p ( D | Hj ) D p ( D | log sj ) , given by equation 5.3. The maximum likelihood is obtained for sMP D 1.3110. This yields a test set performance of 90.6% for both LS-
1138
T. Van Gestel, et al.
Figure 3: The posterior class probability P ( y D C 1 | x, D, H ) as a function of the inputs x (1) and x (2) for the rsy data set.
SVM classiers. We also trained a gaussian process for classication with the Flexible Bayesian Modeling toolbox (Neal, 1997). A logistic model with constant term and RBF kernel in the covariance function yielded an average test set performance of 89.9%, which is not signicantly different from the LS-SVM result given the standard deviation from Table 2. This table is discussed further in the next section. In the logistic model, the parameters are directly optimized with respect to the output probability using sampling techniques. The LS-SVM classier formulation assumes a gaussian distribution on the errors between the projected feature vectors and the targets (or class centers), which allows deriving analytic expressions on the three levels of inference. The evolution of the posterior class probabilities P ( y D C 1 | x, D, H ) is plotted in Figure 3 for x (1) 2 [¡1.5, 1.5] and x (2) 2 [¡0.5, 1.5]. The corresponding contour plot is given in Figure 2, together with the location of the training points. Notice how the uncertainty on the class labels increases as the new input x is farther away from the training data. The value zMP 2 increases when moving away from the decreases while the variance sz,t training data. We also intentionally unbalanced the test set by dening a new test set from the original set: the negatively and positively labeled instances of
Bayesian LS-SVM Classier
1139
the original set are repeated three times and once in the new set, respectively. This corresponds to prior class probabilities P ( y D ¡1) D 0.75 and P ( y D C 1) D 0.25. Not taking these class probabilities into account, a test set accuracy of 90.9% is obtained, while one achieves a classication performance of 92.5% when the prior class probabilities are taken into account. 7.2 Empirical Results on Binary Classication Data Sets. The test set classication performance of the Bayesian (Bay) LS-SVM classier with RBF kernel was assessed on 10 publicly available binary classication data sets. We compare the results with LS-SVM and SVM classication and GP regression (LS-SVM without bias term) where the hyperparameter and kernel parameter are tuned by 10-fold cross-validation (CV10). The BUPA Liver Disorders (bld), the Statlog German Credit (gcr), the Statlog Heart Disease (hea), the John Hopkins University Ionosphere (ion), the Pima Indians Diabetes (pid), the Sonar (snr), and the Wisconsin Breast Cancer (wbc) data sets were retrieved from the UCI benchmark repository (Blake & Merz, 1998). The synthetic data set (rsy) and Leptograpsus crabs (cra) are described in Ripley (1996). The Titanic data (tit) was obtained from Delve. Each data set was split up into a training (2/3) and test set (1/3), except for the rsy data set, where we used N D 250 and Ntest D 1000. Each data set was randomized 10 times in order to reduce possible sensitivities in the test set performances to the choice of training and test set. For each randomization, the design procedure from section 6.1 was used to estimate m and f from the training data for the Bayesian LS-SVM, while selecting s from a candidate set S D [s1 , s2 , . . . , sj , . . . , sNs ] using model comparison. The classication decision (LS-SVM BayM) is made by the Bayesian decision rule, equation 3.17, using the moderated output and is compared with the classier, equation 2.22, which is denoted by (LS-SVM Bay). A 10-fold cross validation (LS-SVM, SVM and GP CV10) procedure was used to select the parameters3 c or C and s yielding best CV10 performance from the set S and an additional set C D [c 1 , c 2 , . . . , c Ng ]. The same sets were used for each algorithm. In a second step, more rened sets were dened for each algorithm4 in order to select the optimal parameters. The classication decisions were obtained from equation 2.4 with the corresponding w and b determined in the dual space for each algorithm. We also designed the GP regressor within the evidence framework for a GP with RBF kernel (GP Bay) and for a GP with RBF kernel and an additional bias term b in the kernel function (GPb Bay). In Table 1, we report the average test set performance and sample standard deviation on ten randomizations for each data set (De Groot, 1986). The 3 Notice that the parameter C of the SVM plays a similar role as the parameter c of the LS-SVM. 4 We used the Matlab SVM toolbox (Cawley, 2000), while the GP CV10 was obtained from the linear system, equation 2.23.
1140
T. Van Gestel, et al.
Table 1: Comparison of the 10 Times Randomized Test Set Performances of LS-SVMs, GPs, and SVM. n bld cra gcr hea ion pid rsy snr tit wbc
N
Ntest
Ntot
LS-SVM LS-SVM LS-SVM (BayM) (Bay) (CV10)
6 230 115 345 69.4 (2.9) 69.4 (3.1) 6 133 67 200 96.7 (1.5) 96.7 (1.5) 20 666 334 1000 73.1(3.8) 73.5(3.9) 13 180 90 270 83.6 (5.1) 83.2 (5.2) 33 234 117 351 95.6(0.9) 96.2 (1.0) 8 512 256 768 77.3 (3.1) 77.5 (2.8) 2 250 1000 1250 90.2 (0.7) 90.2 (0.6) 60 138 70 208 76.7(5.6) 78.0 (5.2) 3 1467 734 2201 78.8 (1.1) 78.7 (1.1) 9 455 228 683 95.9(0.6) 95.7(0.5)
Average performance Average ranks Probability of a sign test
83.7 2.3 1.000
83.9 2.5 0.754
69.4 (3.4) 96.9 (1.6) 75.6 (1.8) 84.3 (5.3) 95.6 (2.0) 77.3 (3.0)
89.6(1.1) 77.9 (4.2) 78.7 (1.1) 96.2 (0.7) 84.1 2.5 1.000
SVM (CV10)
GP (Bay)
GPb (Bay)
GP (CV10)
69.2 (3.5) 95.1 (3.2)
69.2 (2.7) 96.4 (2.5) 76.2 (1.4) 83.1 (5.5)
68.9 (3.3)
69.7 (4.0) 96.9 (2.4) 75.4 (2.0) 84.1 (5.2)
74.9(1.7) 83.4 (4.4) 95.4(1.7) 76.9 (2.9) 89.7 (0.8) 76.3(5.3) 78.7 (1.1) 96.2 (0.8) 83.6 3.8 0.344
91.0(2.3) 77.6 (2.9) 90.2 (0.7) 78.6 (4.9) 78.5 (1.0) 95.8(0.7) 83.7 3.2 0.754
94.8(3.2) 75.9 (1.7) 83.7 (4.9) 94.4(1.9) 77.5 (2.7) 90.1 (0.8) 75.7 (6.1) 77.2(1.9) 93.7(2.0) 83.2 4.2 0.344
92.4(2.4) 77.2 (3.0) 89.9 (0.8) 76.6 (7.2) 78.7 (1.2) 96.5 (0.7) 83.7 2.6 0.508
Notes: Both CV10 and the Bayesian (Bay) framework were used to design the LS-SVMs. For the Bayesian LS-SVM the class label was assigned using the moderated output (BayM). An RBF kernel was used for all models. The model GPb has an extra bias term in the kernel function. The average performance, average rank, and probability of equal medians using the sign test taken over all domains are reported in the last three rows. Best performances are underlined and denoted in boldface, performances not signicantly different at the 5% level are denoted in boldface, performances signicantly different at the 1% level are emphasized. No signicant differences are observed between the different algorithms.
best average test set performance was underlined and denoted in boldface for each data set. Boldface type is used to tabulate performances that are not signicantly different at the 5% level from the top performance using a two-tailed paired t-test. Statistically signicant underperformances at the 1% level are emphasized. Other performances are tabulated using normal type. Since the observations are not independent, we remark that the t-test is used here only as a heuristic approach to show that the average accuracies on the 10 randomizations can be considered to be different. Ranks are assigned to each algorithm starting from 1 for the best average performance. Averaging over all data sets, a Wilcoxon signed rank test of equality of medians is carried out on both average performance (AP) and average ranks (AR). Finally, the signicance probability of a sign test (PST ) is reported comparing each algorithm to the algorithm with best performance (LS-SVM CV10). These results are denoted in the same way as the performances on each individual data set. No signicant differences are obtained between the different algorithms. Comparing SVM CV10 with GP and LS-SVM CV10, it is observed that similar results are obtained with all three algorithms, which means that the loss of sparseness does not result in a degradation of the generalization performance on these data sets. It is also observed that the LS-SVM and
Bayesian LS-SVM Classier
1141
GP designed within the evidence framework yield consistently comparable results when compared with CV10, which indicates that the gaussian assumptions of the evidence framework hold well for the natural domains at hand. Estimating the bias term b in the GP kernel function by Bayesian inference on level 3 yields comparable but different results from the LS-SVM formulation where the bias term b is obtained on the rst level. Finally, it is observed that assigning the class label from the moderated output, equation 3.17, also yields comparable results with respect to the classier 2.22, but the latter formulation does yield an analytic expression to adjust the bias term for different prior class probabilities, which is useful, for example, in the case of unbalanced training and test set or in the case of different classication costs.
8 Conclusion
In this article, a Bayesian framework has been related to the LS-SVM classier formulation. This least-squares formulation was obtained by modifying the SVM formulation and implicitly corresponds to a regression problem with binary targets. The LS-SVM formulation is also related to kernel sher discriminant analysis. Without the bias term in the LS-SVM formulation, the dual space formulation is equivalent to GPs for regression. The least-squares regression approach to classication allows deriving analytic expressions on the different levels of inference. On the rst level, the model parameters are obtained from solving a linear Karush-Kuhn-Tucker system in the dual space. The regularization hyperparameters are obtained from a scalar optimization problem on the second level, while the kernel parameter is determined on the third level by model comparison. Starting from the LS-SVM feature space formulation, the analytic expressions obtained in the dual space are similar to the expressions obtained for GPs. The use of an unregularized bias term in the LS-SVM formulation results in a zero-mean training error and an implicit centering of the Gram matrix in the feature space, also used in kernel PCA. The corresponding eigenvalues can be used to obtain improved bounds for SVMs. Within the evidence framework, these eigenvalues are used to control the capacity by the regularization term. Class probabilities are obtained within the dened probabilistic framework by marginalizing over the model parameters and hyperparameters. By combination of the posterior class probabilities with an appropriate decision rule, class labels can be assigned in an optimal way. Comparing LS-SVM, SVM classication, and GP regression with binary targets on 10 normalized public domain data sets, no signicant differences in performance are observed. The gaussian assumptions made in the LS-SVM formulation and the related Bayesian framework allow obtaining analytical expressions on all levels of inference using reliable numerical techniques and algorithms.
1142
T. Van Gestel, et al.
Appendix A: Derivations Level 1 Inference
In the expression for the variances se2C and se2¡ , the upper left n f £ n f block of the covariance matrix Q D H ¡1 is needed. Therefore, we rst calculate the inverse of the block Hessian H. Using U D [Q ( x1 ) , . . . , Q ( xN ) ], with V D U T U , the expressions for the block matrices in the Hessian, equation 3.7, are H11 D m Inf C f U U T , H12 D f U 1v and H22 D Nf . Calculating the inverse of the block matrix, the inverse Hessian is obtained as follows: Á" H
¡1
D
" D
In f 0
X 1
#"
¡1 T H11 ¡ H12 H22 H12 0
(m Inf C f G ) ¡1
0 H22
#"
Inf
0
XT
1
#!¡1
¡1 ¡(m Inf C f G) ¡1 H12 H22
¡1 T ¡1 ¡1 T ¡1 C H22 ¡H22 H12 (m Inf C f G ) ¡1 H22 H12 (m Inf C f G ) ¡1 H12 H22
(A.1) # , (A.2)
¡1 with G D U MU T , X D H12H22 and where M D IN ¡ N1 1v 1Tv is the idempotent centering matrix with rank N ¡ 1. Observe that the above derivation results in a centered Gram matrix G, as is also done in kernel PCA (Sch¨olkopf et al., 1998). The eigenvalues of the centered Gram matrix can be used to derive improved bounds for SVM classiers (Sch¨olkopf et al., 1999). In the Bayesian framework of this article, the eigenvalues of the centered Gram matrix are also used on levels 2 and 3 of Bayesian inference to determine the amount of weight decay and select the kernel parameter, respectively. The inverse (m In f C f G ) ¡1 will be calculated using the eigenvalue decomposition of the symmetric matrix G D GT D PT DG, f P D PT1 DG P1 , with P D [P1 P2 ] a unitary matrix and with the diagonal matrix DG D diag([lG,1 , . . . , lG,Neff ]) containing the Neff nonzero diagonal elements of full-size diagonal matrix DG, f 2 Rnf . The matrix P1 corresponds to the eigenspace corresponding to the nonzero eigenvalues, and the null space is denoted by P2 . There are maximally N ¡1 eigenvalues lG,i > 0, and their corresponding eigenvectors ºG,i are a linear combination of U M: ºG,i D cG,i U MvG,i , with cG,i a normalT º ization constant such that ºG,i G,i D 1. The eigenvalue problem we need to solve is the following:
U MU T (U MvG,i ) D lG,i (U MvG,i ) .
(A.3)
Multiplication of equation A.3 to the left with MU T and applying the Mercer condition yields MVMVMvG,i D lG,i MVMvG,i , which is a generalized eigenvalue problem of dimension N. An eigenvector vG,i corresponding to a nonzero eigenvalue lG,i is also a solution of MVMvG,i D lG,i vG,i ,
(A.4)
6 since MVMVMvG,i D lG,i MVMvG,i D 0. By applying the normality con-
Bayesian LS-SVM Classier
1143
q dition of ºG,i , which corresponds to cG,i D 1 / vTG,i MVMvG,i , one nally 1 U MvG,i , and P1 D vTG, i MVMvG, i ¡1 2 lG,i/ vG,i , i D 1, . . . , Nef f . The
obtains P1 D [ºG,1 . . . ºG,Neff ] where ºG,i D p U MUG , with UG ( :, i) D p
1 vG,i vTG, i MVMvG, i
D
remaining n f ¡ Nef f dimensional orthonormal null space P2 of G cannot be explicitly calculated, but using that [P1 P2 ] is a unitary matrix, we have P2 PT2 D In f ¡ P1 PT1 . Observe that this is different from Kwok (1999, 2000), where the space P2 is neglected. This yields (m Inf C f G) ¡1 D P1 ( (m INeff C f DG ) ¡1 ¡ m ¡1 INef f ) PT1 C m ¡1 Inf . By dening h (x ) D U T Q ( x ) and applying the Mercer condition in equation 3.10, one nally obtains expression 3.11. For large N, the calculation of all eigenvalues lG,i and corresponding eigenvectors ºi , i D 1, . . . , N may require long computations. One may expect that little error is introduced by putting the N ¡ rG smallest eigenvalues, m À f lG,i , of G D GT ¸ 0. This corresponds to an optimal rank rG approximation of (m Inf C f G) ¡1 . Indeed, calling GR the rank rG approximation of G, we obtain minGR k(m Inf C f G ) ¡1 ¡ (m In f C f GR ) ¡1 kF . Using the Sherman-Morrison-Woodbury formula (Golub & Van Loan, 1989) this becomes: k(m In f C f G ) ¡1 ¡ (m Inf C f GR ) ¡1 kF D k mf (m In f C f G ) ¡1 G ¡ mf (m Inf C f GR ) ¡1 GR kF . The optimal rank rG approximation for (m Inf C f G ) ¡1 G is obtained by putting its smallest eigenvalues to zero. Using the eigenvalue l decomposition of G, these eigenvalues are m C fG,li G, i . The smallest eigenvalues of (m In f C f G ) ¡1 G correspond to the smallest eigenvalues of G. Hence, the optimal rank rG approximation is obtained by putting the smallest N ¡ rG eigenvalues to zero. Also notice that sz2 is increased by putting lG,i equal to zero. A decrease of the variance would introduce a false amount of certainty on the output. Appendix B: Derivations Level 2 and 3 Inference
First, an expression for det(H) is given using the eigenvalues of G. By block diagonalizing equation 3.7, det(H) is not changed (see, e.g., Horn & Johnson, 1991). From equation A.1, we obtain det H D Nf det(m Inf C f G ). The determinant is the product of the eigenvalues; this yields det H D QNef f Nf m nf ¡Nef f iD 1 (m C f lG,i ) . Due to the regularization term m EW , the Hessian is regular. The inverse exists, and we can write det H ¡1 D 1 / det H. Using equation 2.30, the simulated error ei D yi ¡ ( wTMPQ (xi ) C bMP ) can P also be written as ei D yi ¡ mO Y ¡ wTMP (Q (xi ) ¡ mO U ) , with mO Y D N i D1 yi / N and PN O U D iD1 Q (xi ) / N. Since wMP D (U MU T C c ¡1 Inf ) ¡1 U MY, the error term m
1144
T. Van Gestel, et al.
ED (wMP , bMP ) is equal to ED (wMP , bMP ) D
³ ± 1 (Y ¡ m O Y 1v ) T IN ¡ MU U MU T C c 2
O Y 1v ) £ (Y ¡ m ± 1 T D Y MV DG C c G 2c 2
¡1
Inef f
²¡2
¡1
In f
² ¡1
´2 UM
VGT MY,
(B.1)
where we used the eigenvalue decomposition of G D U MU . In a similar way, one can obtain the expression for EW in the dual space starting from wMP D (U MU T C c ¡1 Inf ) ¡1 U MY: E W ( wMP ) D
± 1 T Y MVG DG DG C c 2
¡1
Inef f
²¡2
T VG MY.
(B.2)
The sum EW ( wMP ) C c ED ( wMP , bMP ) is then equal to ± 1 T Y MVG DG C c 2 ± 1 D YT M MVM C c 2
E W ( wMP ) C c ED ( wMP , bMP ) D
¡1
Inef f
¡1
In
²¡1
²¡1
VGT MY
MY,
(B.3)
which is the same expression as obtained with GP when no centering M is applied on the outputs Y and the feature vectors U . Acknowledgments. We thank the two anonymous reviewers for constructive comments and also thank David MacKay for helpful comments related to the second and third level of inference. T. Van Gestel and J. A. K. Suykens are a research assistant and a postdoctoral researcher with the Fund for Scientic Research-Flanders (FWO-Vlaanderen), respectively. the K.U.Leuven. This work was partially supported by grants and projects from the Flemish government (Research council KULeuven: Grants, GOAMesto 666; FWO-Vlaanderen: Grants, res. proj. G.0240.99, G.0256.97, G.0256.97 and comm. (ICCoS and ANMMM); AWI: Bil. Int. Coll.; IWT: STWW Eureka SINOPSYS, IMPACT); from the Belgian federal government (Interuniv. Attr. Poles: IUAP-IV/02, IV/24; Program Dur. Dev.); and from the European Community (TMR Netw. (Alapedes, Niconet); Science: ERNSI). The scientic responsibility is assumed by its authors. References Baudat, G., & Anouar, F. (2000).Generalized discriminant analysis using a kernel approach. Neural Computation, 12, 2385–2404. Bishop, C. M. (1995). Neural networks for pattern recognition. New York: Oxford University Press.
Bayesian LS-SVM Classier
1145
Blake, C. L., & Merz, C. J. (1998).UCI Repository of Machine Learning Databases. Irvine, CA: University of California, Department of Information and Computer Science. Available on-line: www.ics.uci.edu/»mlearn/MLRepository. html. Brown, P. J. (1977). Centering and scaling in ridge regression. Technometrics, 19, 35–36. Cawley, G. C. (2000). MATLAB Support Vector Machine Toolbox (v0.54b). Norwich, Norfolk, U.K.: University of East Anglia, School of Information Systems. Available on-line: http://theoval.sys.uea.ac.uk/»gcc/svm/ toolbox. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge: Cambridge University Press. De Groot, M. H. (1986).Probabilityand statistics(2nd Ed.). Reading, MA: AddisonWesley. Duda, R. O., & Hart, P. E. (1973). Pattern classication and scene analysis. New York: Wiley. Evgeniou, T., Pontil, M., Papageorgiou, C., & Poggio, T. (2000) Image representations for object detection using kernel classiers. In Proc. Fourth Asian Conference on Computer Vision (ACCV 2000) (pp. 687–692). Taipei, Thailand. Gibbs, M. N. (1997). Bayesian gaussian processes for regression and classication. Unpublished doctoral dissertation, University of Cambridge. Golub, G. H., & Van Loan, C. F. (1989).Matrix computations. Baltimore, MD: Johns Hopkins University Press. Gull, S. F. (1988). Bayesian inductive inference and maximum entropy. In G. J. Erickson & R. Smith (Eds.), Maximum-entropy and bayesian methods in science and engineering (Vol. 1, pp. 73–74). Norwell, Ma: Kluwer. Horn, R. A., & Johnson, C. R. (1991). Topics in matrix analysis. Cambridge: Cambridge University Press. Kwok, J. T. (1999). Moderating the outputs of support vector machine classiers. IEEE Trans. on Neural Networks, 10, 1018–1031. Kwok, J. T. (2000). The evidence framework applied to support vector machines. IEEE Trans. on Neural Networks, 11, 1162–1173. MacKay, D. J. C. (1992). The evidence framework applied to classication networks. Neural Computation, 4(5), 698–714. MacKay, D. J. C. (1995). Probable networks and plausible predictions—A review of practical Bayesian methods for supervised neural networks. Network:Computation in Neural Systems, 6, 469–505. MacKay, D. J. C. (1999). Comparison of approximate methods for handling hyperparameters. Neural Computation, 11(5), 1035–1068. Mika, S., R¨atsch, G., & Muller, ¨ K.-R. (2001). A mathematical programming approach to the Kernel Fisher algorithm. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural informationprocessingsystems,13 (pp. 591–597). Cambridge, MA: MIT Press. Mukherjee, S., Tamayo, P., Mesirov, J. P., Slonim, D., Verri, A., & Poggio, T. (1999). Support vector machine classication of microarray data (CBCL Paper 182/AI Memo 1676). Cambridge, MA: MIT.
1146
T. Van Gestel, et al.
Neal, R. M. (1997). Monte Carlo implementation of gaussian process models for Bayesian regression and classication (Tech. Rep. No. CRG-TR-97-2). Toronto: Department of Computer Science, University of Toronto. Platt, J. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In A. Smola, P. Bartlett, B. Sch¨olkopf, & D. Schuurmans (Eds.), Advances in large margin classiers.Cambridge, MA: MIT Press. Rasmussen, C. (1996). Evaluation of gaussian processes and other methods for nonlinear regression. Unpublished doctoral dissertation, University of Toronto, Canada. Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge: Cambridge University Press. Rosipal, R., & Girolami, M. (2001). An expectation-maximization approach to nonlinear component analysis. Neural Computation 13(3), 505–510. Saunders, C., Gammerman, A., & Vovk, K. (1998). Ridge regression learning algorithm in dual variables. In Proc. of the 15th Int. Conf. on Machine Learing (ICML-98) (pp. 515–521). Madison, WI. Sch¨olkopf, B., Shawe-Taylor, J., Smola, A., & Williamson, R. C. (1999). Kerneldependent support vector error bounds. In Proc. of the 9th Int. Conf. on Articial Neural Networks (ICANN-99) (pp. 304–309). Edinburgh, UK. Sch¨olkopf, B., Smola, A., & Muller, ¨ K.-M. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299–1319. Shashua, A. (1999). On the equivalence between the support vector machine for classication and sparsied Fisher ’s linear discriminant. Neural Processing Letters, 9, 129–139. Smola, A., Sch¨olkopf, B., & Muller, ¨ K.-R. (1998). The connection between regularization operators and support vector kernels. Neural Networks, 11, 637– 649. Sollich, P. (2000). Probabilistic methods for support vector machines. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12 Cambridge, MA: MIT Press. Suykens, J. A. K. (2000). Least squares support vector machines for classication and nonlinear modelling. Neural Network World, 10, 29–48. Suykens, J. A. K., & Vandewalle, J. (1999). Least squares support vector machine classiers. Neural Processing Letters, 9, 293–300. Suykens, J. A. K., & Vandewalle, J. (2000). Recurrent least squares support vector machines. IEEE Transactions on Circuits and Systems-I, 47, 1109– 1114. Suykens, J. A. K., Vandewalle, J., & De Moor, B. (2001). Optimal control by least squares support vector machines. Neural Networks, 14, 23–35. Van Gestel, T., Suykens, J. A. K., Baestaens, D.-E., Lambrechts, A., Lanckriet, G., Vandaele, B., De Moor, B., & Vandewalle, J. (2001). Predicting nancial time series using least squares support vector machines within the evidence framework. IEEE Transactions on Neural Networks, 12, 809– 812. Vapnik, V. (1995). The nature of statistical learning theory. New York: SpringerVerlag.
Bayesian LS-SVM Classier
1147
Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Williams, C. K. I. (1998). Prediction with gaussian processes: From linear regression to linear prediction and beyond. In M. I. Jordan (Ed.), Learning and inference in graphical models. Norwell, MA: Kluwer Academic Press. Williams. C. K. I., & Barber, D. (1998). Bayesian classication with gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 1342–1351. Received July 6, 2000; accepted September 12, 2001.
LETTER
Communicated by Sara Solla
Neural Network Pruning with Tukey-Kramer Multiple Comparison Procedure Donald E. Duckro
[email protected] Dennis W. Quinn dennis.quinn@at.edu Samuel J. Gardner III Air Force Institute of Technology, Department of Mathematics and Statistics, Wright-Patterson Air Force Base, Ohio 45433, U.S.A.
Reducing a neural network’s complexity improves the ability of the network to generalize future examples. Like an overtted regression function, neural networks may miss their target because of the excessive degrees of freedom stored up in unnecessary parameters. Over the past decade, the subject of pruning networks produced nonstatistical algorithms like Skeletonization, Optimal Brain Damage, and Optimal Brain Surgeon as methods to remove connections with the least salience. The method proposed here uses the bootstrap algorithm to estimate the distribution of the model parameter saliences. Statistical multiple comparison procedures are then used to make pruning decisions. We show this method compares well with Optimal Brain Surgeon in terms of ability to prune and the resulting network performance.
1 Introduction
Large neural networks are easy to t to a data set but may be less than adequate in their ability to predict future examples. The more complex the neural network model, the easier it is to essentially “memorize” the data (i.e., obtain a near-perfect t). This is similar to the situation in an overspecied linear statistical model, and the associated problems (larger variances, poor performance) are observed for both classes of models. In order to nd the best set of parameter and variables to use in a given model, some model selection procedure must often be used. In the neural network literature, these procedures are often called pruning algorithms. Popular algorithms include Skeletonization, Optimal Brain Damage, and Optimal Brain Surgeon. In this article, we propose a novel method, based on statistical inference procedures, to perform neural network pruning. Section 2 gives the necessary background for an understanding of neural networks c 2002 Massachusetts Institute of Technology Neural Computation 14, 1149– 1168 (2002) °
1150
Donald E. Duckro, Dennis W. Quinn, and Samuel J. Gardner III
and network pruning and the statistical procedures we incorporate into our method. Section 3 presents the Tukey-Kramer multiple comparison pruning algorthim. Section 4 presents examples and simulation results comparing our proposed method with Optimal Brain Surgeon. Many of the terms used in the neural network community may be unfamiliar to the statistician. Wherever necessary, the corresponding statistical nomenclature is included (emphasized in parentheses) in the text. 2 Background 2.1 Neural Networks Overview. Neural networks (NNs) represent a wide class of exible nonlinear models. This article focuses on feedforward neural networks as predictors (regression, discrimination), but the results may be applied to several types of NNs. A comprehensive introduction to NNs is given by Bishop (1995). Equations 2.1 through 2.3 illustrate the general form of a feedforward NN’s output:
yk D f k ( x1 , x2 , . . . , xd ) ,
k D 1, . . . , O,
(2.1)
where 0 fk ( x1 , x2 , . . . , xd ) D
1
(2) fo @w 0k C
l X
(2) wjk aj A
(2.2)
jD 1
and Á aj D fh
(1) w 0j C
d X
! (1) wij xi
.
(2.3)
i D1
The inputs (independent variables), xi , and the outputs (dependent variables), y k are connected together by two activation functions, fo and fh . These functions are often nonlinear and monotonic. Throughout this article, we will use fo ( u ) D fh ( u ) D (1 C e¡u ) ¡1 . The w’s are known as weights (parameters). The subscripts on the activation functions indicate whether they are used in the hidden (h) or the output (o) layer of the network. In the hidden layer of the network, there are l hidden nodes, and for each of the O output variables, there is an output node fk. Using this terminology, one can picture the NN as a diagram of connected inputs, hidden nodes, and outputs. An example of a network diagram for a feedforward NN with one output, three hidden nodes, and multiple inputs is given in Figure 1 (see section 2.5). From a training set X D f(y Ti , xTi ) D ( y1 , y2 , . . . , yO , x1 , x2 , . . . , xd ) i , i D ( ) 1 . . . Ng, the weights, wijm are learned (estimated) using an iterative algorithm
Neural Network Pruning
1151
designed to minimize an error function, E . For a regression-prediction problem, the typical error function is the sum of squared error (SSE) given by
E (w ) D
N X i D1
( y i ¡b yi ) T ( yi ¡ b yi ) ,
(2.4)
where w is the vector of the weights used in the NN and b y i D (b y1,i , . . . , b yO,i ) T are the output layer predictions for the ith exemplar (training sample point) with the estimates for the weight substituted in (i.e., b y k,i D fk ( x i I w ) ). The SSE is a summation of the squared deviations of each observation of the training set from the predicted output. A gradient-descent method is used O that minimizes the least-squares criteria. This method to nd the vector w is derived from the Taylor series expansion of E ( w ) around the minimizing vector w opt given by µ
E ( w ) ¼ E ( wopt ) C ( w ¡ w opt ) T C
@E ( w )
@w
¶ w D w opt
1 ( w ¡ w opt ) T H ( w ¡ w opt ) , 2
(2.5) 2
where H is the Hessian matrix of E ( w ) , Hij D [ @w@i @wj E ( w )]w Dw opt . (Note that for simplicity, we may refer to the weights with a single subscript only.) Given an initial starting value for w , one can iterate with the Widrow-Hoff gradient-descent method (Widrow & Hoff, 1960), w à w ¡g
@E , @w
(2.6)
where g is a predetermined or adaptive learning rate. This method is also known as backpropagation. The iterations continue until a stopping criterion is satised (e.g., maximum number of iterations, mean square error target, change in mean squared error target). NN model simplication involves removing superuous weights so as to achieve better generalization (predictive ability). White’s (1989) development of statistical inference applied to network architecture provides a means to test which weights are superuous. Although he did not specically reference Wald (1943), White’s hypothesis testing is based on the chi-square distribution that Wald used for tests involving several parameters. Pruning (parameter selection) algorithms have been developed to trim large, quick learning networks into sufcient minimal networks. Devoid of heuristic information to dictate the order of parameter elimination, pruning algorithms employ simple strategies to select parameters for elimination. The simplest strategy is the elimination of the smallest weights. However,
1152
Donald E. Duckro, Dennis W. Quinn, and Samuel J. Gardner III
the signicance of a weight to an NN is not the size of the weight but its contribution to the ability of the NN to match the data. This is analogous to multiple regression (see Neter, Kutner, Nachtsheim, & Wasserman, 1996) in which a model parameter remains in a model or is eliminated from the model based on the performance of the full model (which contains the parameter in question) compared to the reduced model (which is identical to the full model except that the parameter in question has been set to zero). Pruning decisions are often based on a salience measure: the effect the pruning decision has on the training error. Formally, the salience of a single weight wp is dened as Lp D min [E ( w ) ] ¡ min[E ( w )]. s.t.wp D 0
(2.7)
When equation 2.5 has been minimized and the best estimate of the weights has been found, the gradient term is neglible. Substituting the Taylor series approximation into equation 2.7 gives Lp ¼
1 min [D w T H D w ], 2 s.t.wp D 0
(2.8)
where D w D ( w ¡ w opt ). This yields Lp D 12 Hppw2opt,p . Optimal Brain Damage (OBD; Le Cun, Denker, & Solla, 1990) uses this measure of salience to decide which weight should be eliminated. Specically, the NN is trained, and each of the individual saliences is computed. The weight that has the smallest salience is set to zero, and the NN is retrained with that constraint. This process is repeated (retaining all constraints found through the pruning process) until a stopping criterion is met. The OBD procedure offers variations to this algorithm to include the deletion of some low-saliency parameters at each iteration, but the protocol is not well dened from a statistical perspective. The assumption that the low-salient parameters are irrelevant is consistent with White’s hypotheses, which in turn are consistent with the Wald test. Optimal Brain Surgeon (OBS) (Hassibi & Stork, 1993) is a renement of OBD, allowing for a full Hessian matrix H . Retraining is computationally burdensome for OBD, whereas OBS prunes weights using a constrained optimization approach and avoids retraining. If one sets wp D 0 for a given p, then the optimum D w that satises equation 2.8 is given by
Dw
D ¡wopt,p
1 [H ¡1 ]
pp
H ¡1 ¢ ep ,
(2.9)
where ep is the unit vector in the weight space corresponding to weight wp . The corresponding salience is then Lp D
2 1 wopt,p . 2 [H ¡1 ]pp
(2.10)
Neural Network Pruning
1153
OBS uses this measure of salience to make the single-step pruning decision. However, in contrast with OBD, the weight update formula of OBS allows for quick pruning because retraining is not necessary. One simply uses the weight update formula and recalculates H ¡1 in each pruning cycle. The single-step pruning based on the lowest-saliency parameter is consistent with White’s hypotheses of irrelevance. Hassibi and Stork (1993) suggest that several weights may be deleted simultaneously, and Hassibi, Stork, and Wolff (1994) consider several extensions of OBS. However, neither Hassibi and Stork (1993) nor Hassibi et al. (1994) provide a detailed weight update formula for the deletion of more than one weight. 2.2 Tukey Multiple Comparison Procedure. To prune the nonlinear neural net model of equations 2.1 through 2.3, we now consider the linear statistical model for the saliences
Lij D m i C eij ,
i D 1, . . . , P,
j D 1, . . . , ni ,
(2.11)
where the random variable Li represents the observed salience of the ith weight, the m i are constant parameters representing the expected value of the salience, and the eij are random error terms or noise in the jth observation of each salience. There are P weights and ni samples of the saliency Li . The Tukey Multiple Comparison Procedure considers the set of all pairwise comparisons under the null hypothesis that the saliences are the same and the alternative hypothesis that the saliences are different: H0 : m i D m j 6 m j, Ha : m i D 6 where i, j D 1, 2, ¢ ¢ ¢ , P with i D j. This procedure relies on the studentized range distribution, where the studentized range is the ratio of two dispersion measures: the range and standard deviation. This statistic allows us to quantify the relative differences in salience means. Let L1 , . . . , LP be independently distributed N (m , s 2 ) , and suppose the variance estimate s2 is based on º degrees of freedom independent of Li such that s2 / s 2 is chi-squared distributed, Â 2 (º). Then the studentized range with parameters ( P, º) is dened as
qD
max Li ¡ min Li » q ( PI º). s
(2.12)
For multiple observations of the salience of a connection, a salience mean P LN i D n1i jnDi 1 Lij is calculated, and the standard deviation of the means’
1154
Donald E. Duckro, Dennis W. Quinn, and Samuel J. Gardner III
differences for the pair ( i, j) is denoted by s ( LN i ¡ LN j ) . If the eij are independent and identically distributed (i.i.d.) N (0, s 2 ) , then an appropriate test statistic for each pairwise comparison of salience means is p 2(LN i ¡ LN j ) 2( LN i ¡ LN j ) ± ², q D D p s (LN i ¡ LN j ) s2 ¢ n1i C n1j ¤
p
(2.13)
P P P where s2 D º1 PiD 1 jnDi 1 (Lij ¡ LN j ) 2 . For nonequal sample sizes, º D PiD1 ni ¡ PI for equal sample sizes ni D B, it follows that º D BP ¡P. In equation 2.13, the ( LN i ¡ LN j ) term does not cancel because it represents an arithmetic quantity in the numerator, while in the denominator, it is the argument of the function s. Using the test statistic q¤ of equation 2.13, the null hypothesis H0 : m i D m j is rejected when |q¤ | > q1¡a (PIº) , and we conclude that salience means are different with a family condence coefcient of at least 1 ¡ a for all pairwise comparisons. The signicance level a is sometimes considered the level of risk of rejecting a null hypothesis when in fact it is true. The family signicance level for the pairwise comparison test is exactly a when all the sample sizes ni are equal. When the sample sizes are not equal, the procedure is sometimes called the Tukey-Kramer Procedure (Tukey, 1991; Neter et al., 1996). In this case, the family signicance level is less than a, and all decisions are more conservative. The above test assumes that the mean estimates LN i are independent. In order to account for correlated sample means (as will be observed in section 2.5), Kramer (1957) and Brown (1984) developed multiple range tests and condence intervals for dependent means. The estimated variances and covariances of the means, denoted cii s2 and cij s2 , are used to adjust the critical value for the test in a way to provide simultaneous condence intervals of the form m i ¡ m j 2 [LN i ¡ LN j § Zij q1¡a ( PI º)],
(2.14)
where Z2ij D ( cii s2 ¡ 2cij s2 C cjj s2 ) / 2, have condence level at least 1 ¡ a. A method for grouping “equivalent” means is to determine which group of ordered sample means differs by no more than Zij q1¡a ( PI º) . Mean estimates of each connection’s salience are not readily available from network training. Section 2.3 introduces a procedure to facilitate their estimation. 2.3 Bootstrap Training. Estimates for the salience means (LN i ), variances
(cii s2 ), and covariances (cij s2 ), which appear in equation 2.14, require multiple observations of each connection’s salience. Efron (1982) has shown that the bootstrap algorithm nds a distribution that can be used as an approximation of the sampling distribution for a given statistic. Let U ( X ) be a random variable of interest, such as a salience, where X D fx 1 , x 2 , . . . , x N g indicates the entire i.i.d. sample from distribution F , and N is the size of
Neural Network Pruning
1155
the training set. On the basis of having observed X , we wish to estimate some aspect of U’s sampling distribution, EF g ( U ) . A three-part algorithm is used: 1. Fit the empirical cumulative distribution function (CDF) of F , O mass 1 at x i , F: N
i D 1, 2, . . . , N.
O 2. Draw a bootstrap sample from F, iid
O X ¤ D fx ¤1 , x¤2 , . . . , x ¤N g » F, and calculate g ( U¤ ) D g (U ( X ¤ ) ) .
3. Independently repeat step 2 a large number of times (B) obtaining bootstrap replications g ( U¤1 ) , g (U ¤2 ) , . . . , g ( U¤B ) , and calculate gN ( U¤ ) PB ¤i D i D 1 g ( U ) / B, the bootstrap estimate of EF g ( U ) . Efron’s bootstrap algorithm has been applied to neural networks in the past to build condence intervals for the network output. In this article, we use the algorithm to construct mean estimates of the network’s saliences. Our goal is to provide a method for grouping the least signicant salience with similarly small saliences and eliminating their corresponding weights from the model. The determination of “similar” and “small” requires an estimate of the means and covariances of the sampling distribution of the saliences. Using the bootstrap procedure to estimate the means and covariances of saliences is what we term bootstrap training. To accelerate this procedure, it was discovered that a savings in the number of calculations needed for bootstrap training could be achieved by rst training the NN with the original training set and using the estimated weight parameters as the initial estimate for training on each bootstrap sample. This also serves another purpose in that it helps to overcome problems with parameter identiability in the NN. A neural network such as the one presented in Figure 1 suffers from the property that there are multiple “equivalent” networks that will give the same results: any permutation of the hidden-layer nodes provides an equivalent network. The focus of this research is not so much to estimate the specic parameters in the NN, but rather to estimate the complexity of the network needed to provide good generalization. By using the initial training set to nd a good starting point for each bootstrap training iteration, we keep the location of the weight parameters and their saliences in the same neighborhood for each iteration, avoiding the identiability problem. 2.4 Stopping Criteria. Because both the NN training process and the pruning process are iterative, it is important to determine an appropriate
1156
Donald E. Duckro, Dennis W. Quinn, and Samuel J. Gardner III
stopping criterion. For the training portion, a maximum number of iterations or a minimum change in E (w ) is used. For pruning, one could consider several rules for stopping, including pruning to a predetermined complexity, a maximum number of pruning iterations, or an error rule. The error rule can search for the minimum validation error of the test set (Hassibi et al., 1994). The error rule can stop pruning when the additional error is no longer much smaller than the present total error (Hassibi & Stork, 1993). In this article, we consider an error rule that stops the pruning process when an “F-statistic” criterion has been met. The F-statistic is frequently used to test the appropriateness of model complexity. For instance, Neter et al. (1996, pp. 120, 230, 268, 353, 911) use the F-statistic to test for the appropriateness of model complexity for linear models. In these tests, SSr is the SSE for the reduced model, SS f is the SSE for the full model, P f is the number of parameters in the full model, and k is the number of parameters being considered for elimination from the full model. The number of degrees of freedom for the full model is df f D N ¡ P f , and the number of degrees of freedom for the reduced model is df r D N ¡ ( P f ¡ k) . To compare the reduced model with the full model, the null hypothesis is that the reduced model explains the data as well as the full model. The F-statistic for this test is F¤ D
( SSr ¡ SS f ) / ( df r ¡ df f ) SS f / df f
D
( SSr ¡ SS f ) / k . SS f / ( N ¡ P f )
(2.15)
A large difference in the error sum of squares results in F¤ being greater than a critical value; in this case, we reject the null hypothesis and conclude that the reduced model does not explain the data as well as the full model does. Otherwise, for small F¤ , we accept the reduced model as being statistically equivalent to the full model. This is analogous to White’s (1989) development of the irrelevant input and hidden unit hypotheses and use of the Wald test to dismiss weights close to zero. In our pruning, the goal is to obtain the simplest possible NN, so we prune until the pruned model is no different from using the mean of the data as the model. At that point, we reject the most recent pruning, revert to the previous NN, and stop. For this approach, the full model is the most recent pruned model and the reduced model is the model that uses the mean of the data. Consequently, the F-statistic of equation 2.15 uses SS f D SSE, the usual error sum of squares of equation 2.4, and SSr D SSTO, the usual total sum of squares about the mean, calculated similar to equation 2.4 except that the mean, y, N replaces the prediction, yO i . The degrees of freedom are df f D N ¡ P f and df r D N ¡ 1. Therefore, the Fstatistic is F¤ D
(SSTO ¡ SSE) / ( P f ¡ 1) SSR/ (P f ¡ 1) MSR D D , SSE / ( N ¡ P f ) SSE / (N ¡ P f ) MSE
(2.16)
Neural Network Pruning
1157
where SSR is the regression sum of squares. Consequently, the F-statistic reduces to the usual ratio of the regression mean square and the mean square error. For the range of degrees of freedom we are using, the critical value F D 3 generally satises a signicance level of a D 0.05. If F¤ < 3, we stop pruning and revert back to the previous NN because our latest pruned model is no different from the model that uses the mean of the data to predict. This method provides a consistent error rule to stop pruning as opposed to an unstructured rule of thumb, and its implementation supports the comparison of different pruning algorithms. 2.5 Monk’s Problems. The Monk’s problems (Thrun, Wnek, Michalski, Mitchell, & Cheng, 1991) are excellent benchmark Boolean tests for pruning algorithms. Articles that have used the Monk’s problems as benchmarks include Hadzikadic and Bohren (1997), Hassibi and Stork (1993), Hassibi et al. (1994), Le Cun et al. (1990), Setiono (1997), van de Laar and Heskes (1999), and Young and Downs (1998). Monk’s problems concern the classication of robots exhibiting six different features. Each feature has two, three, or four possibilities, as follows:
Head shape 2 round, square, octagon Body shape 2 round, square, octagon Is smiling 2 yes, no
Is holding 2 sword, balloon, ag
Coat color 2 red, yellow, green, blue With necktie 2 yes, no.
These features can describe 432 different robots. The three problems involve the training of a neural network to discern certain distinctions of the robots: 1. Head shape and body shape are the same, or coat color is red. 2. Two of the six features have the rst value: round, yes, sword, red. 3. Holding a sword and coat color is red, or coat color is not blue and body shape is not octagon. For the Monk’s problem, the neural network has 17 binary inputs—one for each of the feature options. The choice of the number of hidden nodes in the network is arbitrary. One expects a network with a larger number of hidden nodes to learn faster, but also that the subsequent model would be a poor generalizer. The illustrated pruned network in Figure 1 with three hidden nodes and one output, y, began with 58 connections (weights) between the input layer, three hidden nodes, and one output node. The 58 connections include four biases, which can be interpreted as connections to
1158
Donald E. Duckro, Dennis W. Quinn, and Samuel J. Gardner III
Figure 1: Neural network for Monk’s problem 1 after three-fourths of the connection weights have been pruned. Before pruning, there were 58 connections in the network. The pruned network contains only 15 of the original 58 weights. The network nodes use logistic sigmoid activation functions.
an external unit always in the C 1 state, as shown in Figure 1. This network will estimate the posterior probability of a robot meeting the criterion for a given Monk’s problem, and we would classify it as such if y > .5. To illustrate the bootstrap training procedure, we train this network to perform the rst Monk’s problem. A random sample of size 124 was selected from the complete set of robots as a training set. The NN was trained using these 124 points, and initial weight estimates were obtained. Bootstrap training was performed with B D 128 bootstrap replications (resampling from the 124 points). Figure 2 is the histogram of the logarithm of the saliences for the bias (intercept) weight in the output layer. The bootstrap salience distribution is unimodal with some skewness, similar to the universal shape of all saliences for the input-to-hidden weights of a two-layer network as found by Gorodkin, Hansen, Lautrup, and Solla (1997). The experimental results presented in Figure 2 compare well with the theoretical predictions presented in Figures 1 and 2 of Gorodkin et al. (1997). In the case of bootstrap training, a new resampled training set creates each sample for the salience distribution, while Gorodkin et al. collect observations at each step of one training set. The distribution of bootstrap resamplings of each salience is also consistent with the near-normality expectation of Efron (1982) and support the use of multiple comparison procedures. Figure 3 was produced using 128 bootstrap replications of two of the saliences of a fully connected network. In this gure, we see that the boot(1) (1) strap saliences of w11 and w21 appear to be correlated. By the end of pruning, only one of the weights corresponding to these correlated saliences survives,
Neural Network Pruning
1159
Figure 2: Bootstrap distribution of 128 samplings of the logarithm of the salience (2) (2) L 0 for w 0 , the bias to the output node. The bar chart counts the number of samplings of the logarithm of the bias salience that occur over the range of values observed.
as Figure 1 indicates. The information conveyed by these weights remains important, but it is redundant to have both connections corresponding to these weights in the nal network. In general, salience correlations can appear between connections sharing common nodes and inputs, as well as (1) (1) (1) (2) between connections further apart such as w22 and w63 or w51 and w21 (the connection of input 5 to hidden node 1 correlated with the connection of hidden node 2 and the output). The correlation is supported by the observation of Hassibi and Stork (1993) that every problem they investigated had a strongly nondiagonal Hessian. In determining which weight saliences are close to zero and close to each other, one must also take into account their correlation. 3 Tukey Kramer Multiple Comparison Pruning
The common theme among many pruning algorithms is to remove weights that have the “least salience.” This is done in an iterative algorithm, and as seen in OBS and OBD, weights may be removed one at a time until the
1160
Donald E. Duckro, Dennis W. Quinn, and Samuel J. Gardner III 4
Saliences of Weight between Input Two and Hidden Unit One
8
x 10
7.8
7.6
7.4
7.2
7
6.8
6.6
6.4 2.3
2.4
2.5 2.6 2.7 2.8 2.9 3 3.1 Saliences of Weight between Input One and Hidden Unit One
3.2
3.3 4 x 10 (1)
Figure 3: A scatter plot of the 128 samplings of the bootstrap saliences L11 and (1) (1) L21 for the weights between input 1 and hidden node 1, w11 , and between input (1) 2 and hidden node 1, w21 . The plot suggests a correlation between saliences that needs to be accounted for when making a statistical inference.
stopping criterion for pruning has been reached. It is typical that OBS and OBD will remove several weights. This introduces two questions: How does one determine if a small salience is truly “small”? and Is there a way to justify the elimination of several weights at one time, saving pruning iterations? We believe we answer both of these questions in this article. We propose that using bootstrap training on an NN, we can estimate the distribution of the weight saliences, and in batch, eliminate those weights whose saliences are simultaneously found to be close to zero and not differing signicantly. Bootstrap training allows us to compute mean salience estimates, LN p D
PB
¤(b) bD 1 Lp
B
,
and salience covariance estimates, cov[LN p , LN q ] D
B 1 X ( L¤(b) ¡ LN p ) ( L¤(b) ¡ LN q ) . q B ¡ 1 bD 1 p
Neural Network Pruning
1161
One can then use these estimates in the Tukey-Kramer multiple comparison procedure to form condence intervals for the mean saliences, grouping together the smallest saliences that simultaneously do not differ signicantly from each other. The weights corresponding to these saliences are then set to zero, and the NN is retrained with that constraint. We may continue this process to determine once again which weights in the new model affect the error insignicantly. The advantage over other pruning methods is that this procedure makes a decision to eliminate several weights at one time, perhaps saving the number of pruning cycles required. In addition, the incorporation of statistical inference preserves any weight of small magnitude having salience signicantly different from the least salience or zero. As we stated in section 2, this is consistent with the way parameters are removed from multiple regression models. This new approach is what we term Tukey-Kramer multiple comparison pruning (TKMCP). 4 Examples 4.1 Monk’s Problem 1. Referring to the rst Monk’s problem in section 2.5, ideally we would like to select a network structure and complexity adequate for discerning the desirable characteristics—in this case, determining if a robot has the same head and body shape or is wearing a red coat. To illustrate the capability of TKMCP, a random training sample of 124 robots is drawn from the 432 legal examples. The remaining robots are used to estimate prediction accuracy. Bootstrap training is performed with B D 32 on this random training sample, using a bootstrap sample size of 124. Means and covariance of the weight saliences are estimated from the bootstrap distribution. After 18 pruning cycles of TKMCP, the number of neural network connections reduced to a fourth, from 58 to 15 weights. In six of the cycles, the TKMCP found an insignicant difference at a D .05 between the least mean salience (weight subject to pruning) and up to seven other saliences in the cycle. By zeroing the weights with corresponding “small” saliences (not signicantly different), 25 pruning cycles were avoided when compared to the OBS procedure as presented by Hassibi and Stork (1993) and Hassibi et al. (1994). In looking at Figure 1, note that the pruned network retains connections related directly to the Monk’s problem 1 denition. Head and torso shape and the color red are the only inputs relevant to decide whether the robot answers the description of a robot with the same head as torso or wearing the color red. The results nearly mirror the network pruned by OBS (Hassibi & Stork, 1993). Retraining the algorithm with the 43 weights eliminated produces a classication error of 8.33% on the remaining robots, similar to what we achieved using OBS without multiple comparison
1162
Donald E. Duckro, Dennis W. Quinn, and Samuel J. Gardner III
Table 1: Average Number of Weights Eliminated in the First Iteration of MCP for Three Different-Size Neural Networks. Inputs
Hidden
Weights
Ave Pruned
17 17 17
3 5 10
58 96 191
5.45 7.79 60.74
pruning. While the TKMCP was as sucessful as OBS in trimming the network, the computational time required to collect enough bootstrap resamples for mean weight saliency estimation was much larger for MCP than for OBS. The oating-point operations (ops) for MCP were 40.1 gigaops compared to 2.75 gigaops for OBS. This order-of-magnitude difference in part may be attributed to the relatively small size of the initial network. (The computational burden for OBS comes primarily from the necessity to compute the inverse Hessian matrix, which has squared dimension equal to the number of weights in the model.) Also, reducing the number of bootstrap resampling iterations would benet TKMCP’s computation time, at the expense of perhaps slight poorer pruning decisions at each pruning cycle. To demonstrate the capability of TKMCP in a more favorable arrangement, consider an initial network of 10 hidden nodes. With 191 weights in the network, the training set must be expanded from 124 exemplars to something greater than the number of weights. In this case, 248 exemplars were drawn for the training set. After 32 repetitions of bootstrap training (i.e., B D 32), TKMCP performed a single subset elimination on those weights whose saliences were not signicantly different from the least salience at a D .05. In 119 trials, TKMCP eliminated 30 to 90 weights, with approximately 61 weights on average in a single iteration. Table 1 shows a comparison of the average number of weights pruned on the rst iteration of the TKMCP for various network sizes. In competition with TKMCP (with the 10 hidden node network), OBS was permitted to eliminate the same number of weights for each of the 119 trials. In this situation, TKMCP equaled OBS accuracy in all but two of the trials. Meanwhile, TKMCP required 13% less ops than OBS (average 113 gigaops for TKMCP versus 131 gigaops for OBS). 4.2 Simulation Study. A simulation to compare OBS and TKMCP on networks of varying size and complexity was conducted. Feedforward NNs with varying number of inputs and hidden nodes and one output are examined. Because OBS is well grounded in NN pruning theory, the goal of this study is to demonstrate that TKMCP produces networks that are at least as good as OBS in terms of predictive ability and TKMCP prunes to a suitable complexity when compared to OBS.
Neural Network Pruning
1163
Table 2: F Test Results for the Input (IN), Hidden node (HN), and Algorithm (A) Factors and Interactions for the Pruning Ratio Response. Factor IN HN HN*IN A IN*A HN*A IN*HN*A
F¤
p
174.2 6.6 11.5 8.2 2.3 0.3 0.7
< .0001 0.0122 0.0011 0.0056 0.1585 0.5656 0.3942
Notes: F¤ for each of the factors is compared with the critical value F D 7 for the family signicance level of a D 0.05 (ai D .01). The p-value indicates the probability of being wrong in accepting the effect as relevant to the model. The main effects of input and algorithm and the interaction of hidden node with input meet the criterion as factors to model the pruning ratio.
The design for this study is a 23 factorial design, with the design variables being the number of inputs (IN = 4 or 16), number of hidden nodes (HN = 3 or 9), and the pruning algorithm (A = TKMCP or OBS). To create a truth model from which objective comparisons can be made, a real feedforward NN with 4 or 16 inputs, 3 or 9 hidden nodes, and 1 ( ) output is constructed. The weights w D (wijm ) are generated by sampling independently from a N (0, 1) distribution, and then 25% of the input-tohidden weights are randomly set to zero. This creates a sparser network, creating a need for pruning in the estimation step. A training sample T D ft1 , . . . , tN g is then generated by independent sampling an input x i from a d¡variate (d = 4 or 16) normal distribution, x i » Nd ( (0) , I ), and calculating ti D ( y ( x i , w ) , x i ) . Training samples of size N D 256 are used. We then train an NN model with the same number of inputs and hidden nodes using T as the training set and prune with either OBS or TKMCP to come up with a reduced network. The measure of quality used in this study is the ratio of the total number of weights in the true model to the number of weights found by the pruning algorithm (PR). The measure of predictive ability of the nal network is the MSE for the test data set. Upon completion of an analysis of variance (ANOVA) for the pruning ratio response, we compile the results of the F test in Table 2. The response is inuenced by the network architecture as well as the OBS and TKMCP algorithms. The initial size of the network (e.g., the HN*IN interaction) is a signicant factor affecting algorithm performance. Both main effects (IN and HN) in Figures 4 and 5 would remain in the model to support the interaction effect. Table 2 clearly shows an effect of the algorithms on the pruning ratio response. Figures 6 and 7 show a weak interaction between the algorithms and the other two factors, which is consistent with the ANOVA. The aggre-
1164
Donald E. Duckro, Dennis W. Quinn, and Samuel J. Gardner III
Figure 4: A plot of the pruning ratio (PR) for the two levels of model input (IN). The straight line connects the mean response of PR for IN levels 4 and 16. IN at levels 4 and 16 has an effect on PR response, as is apparent from the changing mean. The plot contains the data from both levels of hidden units and the OBS and TKMCP algorithms.
Figure 5: The effect hidden node (HN) at levels 3 and 9 has on the pruning ratio response is less pronounced compared to IN in Figure 4. This is consistent with its lower F¤ and higher p-value in Table 2. The plot contains data from both levels of input and the OBS and TKMCP algorithms.
Neural Network Pruning
1165
Figure 6: The interaction of inputs at levels 4 and 16 with the OBS and TKMCP algorithms is shown by the slight nonparallel track of the respective pruning ratio means at all levels of hidden nodes. The F¤ and p-value do not support the inclusion of this interaction.
Figure 7: The interaction of inputs at levels 4 and 16 with the OBS and TKMCP algorithms is shown by the slight non-parallel track of the respective pruning ratio means at all levels of hidden nodes. The F¤ and p-value do not support the inclusion of this interaction.
1166
Donald E. Duckro, Dennis W. Quinn, and Samuel J. Gardner III
Table 3: Comparison of TKMCP and OBS Performance. Average Training MSE
Average Test MSE
Hidden Nodes
Inputs
TKMCP
OBS
TKMCP
OBS
3 3 9 9
4 16 4 16
.0027 .0026 .0028 .0295
.0020 .0024 .0018 .0215
.0086 .0077 .0095 .0716
.0094 .0079 .0109 .0787
Notes: With 10 trials for each network architecture in the experiment, the averages of the mean square error of the nal pruned networks suggest an insignicant difference between OBS and TKMCP in nal network performance. It appears that OBS performs better on the training set but worse on the test set of exemplars, which suggests that TKMCP improves the pruned network’s ability to generalize on future examples.
gate of Figures 6 and 7 would place the mean response of the pruning ratio slightly higher for TKMCP, suggesting somewhat more aggressive pruning, particularly for the larger networks. In terms of making a statistical conclusion with the associated ANOVA tables, the results for the pruning ratio meet the traditional homoscedasticity requirements, that is, the desirable condition of equal error variances for all cases. In contrast, if the error had varied as a function of the connection within the network, then we would have had heteroscedasticity, and the homoscedasticity requirement would have been violated. Final network performance can be measured by a number of error rules. In keeping with the theme of section 2.4, we consider MSE of the nal network using the training and test data sets. The training data should measure the network’s ability to understand the present data, and the test data should reect the ability to generalize future data. The results in Table 3 indicate that we have achieved our goal: although TKMCP produces worse performance than OBS on the training set, it produces a better generalizing network than OBS. 5 Conclusion
Neural networks are popular and useful exible nonlinear models used extensively in engineering applications. Statistical understanding of these models is growing, and we believe this article sheds light on the issues and identies potential directions for future work in model selection and pruning procedures and practice. Tukey-Kramer multiple comparison pruning has been shown to be a viable pruning algorithm when compared to Optimal Brain Surgeon, adding distinctive statistical inference for multiple weight elimination and the stopping criterion. Future work in this area should continue to rene algorithms with statistical advantages and com-
Neural Network Pruning
1167
putational efciency, incorporating appropriate distributions of the entire collection of neural network weights and saliences to augment the use of mean estimates. Acknowledgments
D. D. currently works for the Defense Resources Management Institute, Naval Postgraduate School, Monterey, CA, and S. G. now works for the Air Force Research Laboratory Human Effectiveness Division, Wright-Patterson AFB, OH. The views expressed in this article are those of the authors and do not reect the ofcial policy or position of the U.S. Air Force, Department of Defense, or U.S. government. We thank the referees for making valuable suggestions that improved the article. We also thank Tom Reid and Tony White for their inputs during the nal revisions of this article. References Bishop, C. M. (1995) Neural networks for pattern recognition. New York: Oxford University Press. Brown, L. D. (1984) A note on the Tukey-Kramer procedure for pairwise comparisons of correlated means. In T. J. Santner & A. C. Tamhane (Eds.), Design of experiments: Ranking and selection (pp. 1–6). New York: Marcel Dekker. Efron, B. (1982) The jackknife,the bootstrapand other resampling plans. Philadelphia: Society for Industrial and Applied Mathematics. Gorodkin, J., Hansen, L. K., Lautrup, B., & Solla, S. A. (1997). Universal distribution of saliencies for pruning in layered neural networks. International Journal of Neural Systems, 8, 489–498. Hadzikadic, M., & Bohren, B. F. (1997). Learning to predict: INC2.5. IEEE Transactions on Knowledge and Data Engineering, 9(1), 168–173. Hassibi, B., & Stork, D. G. (1993). Second order derivatives for network pruning: Optimal Brain Surgeon. In C. L. Giles, S. J. Hanson, & J. D. Cowan (Eds.), Advances in neural information processing systems, 5 (pp. 164–171). San Mateo CA: Morgan Kaufmann. Hassibi, B., Stork, D. G., & Wolff, G. (1994). Optimal Brain Surgeon: Extensions and performance comparisons. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 263–270). San Mateo CA: Morgan Kaufmann. Kramer, C. Y. (1957). Extension of multiple range tests to group correlated adjusted means. Biometrics, 13, 13–18. Le Cun, Y., Denker, J. S., & Solla, S. A. (1990). Optimal Brain Damage. In D. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 598– 605). San Mateo CA: Morgan Kaufmann. Neter, J., Kutner, M. H., Nachtsheim, C. J., & Wasserman, W. (1996). Applied linear statistical models. Chicago: Irwin. Setiono, R. (1997). A penalty function approach for pruning feedforward neural networks. Neural Computation, 9, 185–204.
1168
Donald E. Duckro, Dennis W. Quinn, and Samuel J. Gardner III
Thrun, S. B., Wnek, J., Michalski, R. S., Mitchell, T., & Cheng, J. (1991). The Monk’s problems—a performance comparison of different algorithms. (Tech. Rep. No. CMU-CS-91-197).Pittsburgh: Carnegie Mellon University, Computer Science Department. Tukey, J. W. (1991). The philosophy of multiple comparisons. Statistical Science, 6(1), 100–116. van de Laar, P., & Heskes, T. (1999). Pruning using parameter and neuronal metrics. Neural Computation, 11, 977–993. Wald, A. (1943). Tests of statistical hypotheses concerning several parameters when the number of observations is large. Transactions of the American Mathematical Society, 54, 426–482. White, H. (1989). Learning in articial neural networks: A statistical perspective. Neural Computation, 1, 425–464. Widrow, B., & Hoff, M. E. (1960). Adaptive switching circuits. IRE WESCON Convention Record, 4, 94–104. Young S., & Downs, T. (1998). “CARVE—a constructive algorithm for realvalued examples.” IEEE Transactions on Neural Networks, 9(6), 1180–1190. Received August 23, 2000; accepted September 6, 2001.
LETTER
Communicated by Andrew Brown
Products of Gaussians and Probabilistic Minor Component Analysis C. K. I. Williams
[email protected] Division of Informatics, University of Edinburgh, Edinburgh EH1 2QL, U.K. F. V. Agakov
[email protected] System Engineering Research Group, Chair of Manufacturing Technology Friedrich-Alexander-University Erlangen-Nuremberg, 91058 Erlangen, Germany
Recently, Hinton introduced the products of experts architecture for density estimation, where individual expert probabilities are multiplied and renormalized. We consider products of gaussian “pancakes” equally elongated in all directions except one and prove that the maximum likelihood solution for the model gives rise to a minor component analysis solution. We also discuss the covariance structure of sums and products of gaussian pancakes or one-factor probabilistic principal component analysis models. 1 Introduction
Recently Hinton (1999) introduced a new product of experts (PoE) model for combining expert probabilities, where probability p ( x | H) is computed as a normalized multiplication of probabilities pi ( x | H i ) of individual experts with parameters H i : Qm pi ( x | H i ) R QmiD 1 . p (x | H) D 0 0 p x0 iD 1 i ( x | H i )d x
(1.1)
Under the PoE paradigm, experts may assign high probabilities to irrelevant regions of the data space as long as these probabilities are small under other experts (Hinton, 1999). Here, we consider a product of constrained gaussians in the d-dimensional data space, whose probability contours resemble d-dimensional gaussian “pancakes” (GP), contracted in one dimension and equally elongated in the other d ¡ 1 dimensions. We refer to the model as product of gaussian pancakes (PoGP) and show that it provides a probabilistic technique for minor component analysis (MCA). This MCA solution contrasts with c 2002 Massachusetts Institute of Technology Neural Computation 14, 1169– 1182 (2002) °
1170
C. K. I. Williams and F. V. Agakov
probabilistic PCA (PPCA) (Tipping & Bishop, 1999; see also Roweis, 1998), which is a probabilistic method for PCA based on a factor analysis model with isotropic noise. The key difference is that PPCA is a model for the covariance matrix of the data, while PoGP is a model for the inverse covariance matrix of the data. In section 2, we discuss products of gaussians. In section 3, we consider products of Gaussian pancakes, discuss analytic solutions for maximum likelihood estimators for the parameters, and provide experimental evidence that the analytic solution is correct. Section 4 discusses the covariance structure of sums and products of gaussian pancakes and one-factor PPCA models. 2 Products of Gaussians
If each expert in equation 1.1 is a gaussian pi (x | H i ) » N (m i , Ci ), the resulting distribution of the product may be expressed as ( ) m 1X ¡1 T ( x ¡ m i ) Ci ( x ¡ m i ) . p ( x | H) / exp ¡ 2 iD 1 By completing the quadratic term in the exponent, it may be easily shown that p ( x | H) » N (m S , C S ) , where Á ! m X X ¡1 ¡1 ¡1 d£d m S D CS (2.1) CS D Ci ½ R , Ci m i ½ R d . i D1
i
To simplify the following derivations, we will assume that pi ( x | Hi ) » 6 N ( 0, Ci ) and thus that p (x | H) » N ( 0, C S ). m S D 0 can be obtained by translation of the coordinate system. The log likelihood for independently and identically distributed (i.i.d.) data under the PoG is then expressed as N N N ¡1 L ( C S ) D ¡ d ln 2p C ln | C ¡1 S | ¡ 2 tr[C S S]. 2 2
(2.2)
Here N is the number of sample points, and S is the sample covariance matrix with the assumed zero mean, that is, SD
N 1 X xi xTi , N i D1
(2.3)
where xi denotes the ith data point. 3 Products of Gaussian Pancakes
In this section we describe the covariance structure of a GP expert and a product of GP experts and discuss the maximum likelihood (ML) solution for the parameters of the PoGP model.
Products of Gaussians and Probabilistic Minor Component Analysis
1171
v1
v2 w
Figure 1: Probability contour of a gaussian pancake in R3 . 3.1 Covariance Structure of a GP Expert. Consider a d-dimensional O and gaussian whose probability contours are contracted in direction w equally elongated in directions v1 , . . . , vd¡1 (see Figure 1). Its inverse covariance may be written as C ¡1 D
d¡1 X iD 1
Ow O T bwO ½ Rd£d , vi vTi b 0 C w
(3.1)
O form a d £ d matrix of normalized eigenvectors of the where v1 , . . . , vd¡1 , w covariance C. b 0 D s0¡2 , bwO D sw¡2 dene inverse variances in the directions O of elongation and contraction, respectively, so that s02 ¸ swO2 . Expression 3.1 can be rewritten in a more compact form as Ow OT C ¡1 D b 0Id C (b wO ¡ b 0 ) w D b 0Id C wwT ,
(3.2)
p O bwO ¡ b 0 and Id ½ Rd£d is the identity matrix. Notice that where w D w according to the constraint considerations b 0 < bwO , and all elements of w are real valued. We see that covariance of the data C of a GP expert can be uniquely determined by the weight vector w, collinear with the direction of contraction
1172
C. K. I. Williams and F. V. Agakov
and the variance in the direction of elongation s02 D b 0¡1 . We can further notice the similarity of equation 3.2 with the expression for the covariance of the data of a one-factor probabilistic principal component analysis model C D s 2 Id C wwT (Tipping & Bishop, 1999), where s 2 is the variance of the factor-independent spherical gaussian noise. The only difference is that it is the inverse covariance matrix for the constrained gaussian model rather than the covariance matrix, which has the structure of a rank-1 update to a multiple of Id . 3.2 Covariance of the PoGP Model. We now consider a product of m GP experts, each contracted in a single dimension. We refer to the model as a (1, m) PoGP, where 1 represents the number of directions of contraction of each expert. We also assume that all experts have identical means. From equations 2.1 and 3.1, the inverse covariance of the resulting (1, m) PoGP model is expressed as C ¡1 S D
m X i D1
Ci¡1 D b S Id C WWT ½ Rd£d ,
(3.3)
where columns of W ½ Rd£m correspond to weight vectors of m PoGP P ( i) experts, and b S D m i D 1 b 0 > 0. Comparing equation 3.3 with the m-factor PPCA, we can make a conjecture that in contrast with the PPCA model where ML weights correspond to principal components of the data covariance (Tipping & Bishop, 1999), weights W of the PoGP model dene projection onto m minor eigenvectors of the sample covariance in the visible d-dimensional space, while the distortion term b S Id explains larger variations.1 The proof of this conjecture is discussed in the appendix. For the covariance described by equation 3.3, we nd that ( Á ! ) m X 1 T T O iw Oi x , ai w p ( x ) / exp ¡ x b S Id C 2 iD 1
(3.4)
O i is a unit vector in the direction of wi and ai D | wi | 2 . This distribution where w can be given a maximum entropy interpretation. From Cover and Thomas (1991, equation 11.4), the maximum entropy distribution obeying P the constraints E[ fi (x ) ] D ri , i D 1, . . . , c is of the form p (x ) / exp ¡ ciD 1 li fi ( x) . Hence, we see that equation 3.4 can be interpreted as a maximum entropy O Ti x) 2 ], i D 1, . . . , m and on E[xT x]. distribution with constraints on E[ (w 1 Because equation 3.3 has the form of a factor analysis decomposition but for the inverse covariance matrix, we sometimes refer to PoGP as the rotcaf model.
Products of Gaussians and Probabilistic Minor Component Analysis
1173
3.3 Maximum-Likelihood Solution for PoGP. In the appendix, it is
shown that the likelihood 2.2 is maximized when WML D Um (L ¡1 ¡ b SML Im ) 1 / 2 RT ,
d¡m b SML D Pd
i D m C 1 li
,
(3.5)
where Um is the d £ m matrix of the m minor components of the sample covariance S, L is the m £ m matrix of the corresponding eigenvalues, and R is an arbitrary m £ m orthogonal rotation matrix. As in PPCA (Tipping & Bishop, 1999), the distortion term accounts for the variance in the directions lying outside the space spanned by WML . Thus, the maximum likelihood solution for the weights of the (1, m) PoGP model corresponds to m scaled and rotated minor eigenvectors of the sample covariance S and leads to a probabilistic model of minor component analysis. As in the PPCA model, the number of experts m is assumed to be lower than the dimension of the data space d. 3.4 Experimental Conrmation . In order to conrm the analytic results, we have performed experiments comparing the derived ML parameters of the PoGP model with convergence points of scaled conjugate gradient (SCG) optimization (Bishop, 1995; Møller, 1993) performed on the log likelihood 2.2. We considered different data sets in data spaces of dimensions d, varying from 2 to 25 with m D 1, . . . , d ¡ 1 constrained gaussian experts. For each choice of d and m, we looked at three types of sample covariance matrices, resulting in different types of solutions for the weights W ½ Rd£m . In the rst case, S was set to have s · d ¡ m identical largest eigenvalues, and we expected all expert weights to be retained in W. In the second case, S was set up in such a way that d ¡ m < s < d, so that variance in some directions could be explained by the noise term and some columns of W could be equal to zeros. Finally, we have considered the degenerate case when the sample covariance was a multiple of the identity matrix Id and expected W D 0 (see equations 3.5 and 3.2). For all of the considered cases, we performed 30 runs of the scaled conjugate gradient optimization of the likelihood, started at different points of the parameter space fW, b S g. To ensure that b S was nonnegative, we parameterized it as b S D ex , x 2 R. For each data set, the SCG algorithm converged to approximately the same value of the log likelihood (with the largest observed standard deviation of the likelihood » 10¡12). The largest observed ( ) ( ) absolute error between the analytic fW ( A) , b SA g and the SCG fW ( SCG) , b SSCG g parameters satised
max[| W ( A) ¡ W ( SCG) R |]ij / 10¡8 i, j
( )
(
)
|b SA ¡ b SSCG | / 10¡8 ,
1174
C. K. I. Williams and F. V. Agakov
where R ½ Rm£m is a rotation matrix. Moreover, each numeric solution resulted in the expected type of W ( SCG) for each type of the sample covariance. The experiments therefore conrm that the method of the scaled conjugate gradients converges to a point in the parameter space that results in the same values of the log likelihood as the analytically obtained solution (up to the expected arbitrary rotation factor for the weights). 3.5 Intuitive Interpretation of the PoGP Model. An intuitive interpretation of the PoGP model is as follows: Each gaussian pancake imposes an approximate linear constraint in x space. Such a linear constraint is that x should lie close to a particular hyperplane. The conjunction of these constraints is given by the product of the gaussian pancakes. If m ¿ d, it will make sense to dene the resulting gaussian distribution in terms of the constraints. However, if there are many constraints (m > d / 2), then it can be more efcient to describe the directions of large variability using a PPCA model rather than the directions of small variability using a PoGP model. This issue is discussed by Xu, Krzyzak, and Oja (1991) in what they call the dual subspace pattern recognition method, where both PCA and MCA models are used (although their work does not use explicit probabilistic models such as PPCA and PoGP). 4 Related Covariance Structures
We have discussed the covariance structure of a PoGP model. For realvalued densities, an alternative method for combining Pm experts is to sum the random vectors produced by each expert, x D i D 1 xi ; let us call this a sum of experts (SoE) model. (Note that the SoE model refers to the sum of the random vectors, while the PoE model refers to the product of their densities. The distribution of the sum of random variables is obtained by the convolution of their densities.) For zero-mean gaussians, the covariance of the SoE is simply the sum of the individual covariances, in contrast to the PoE, where the overall inverse covariance is the sum of the individual inverse covariances. Hence, we see that the PPCA model is in fact an SoE model, where each expert is a one-factor PPCA model. This leads us to ask what the covariance structures are for a product of one-factor PPCA models and a sum of gaussian pancakes. These questions are discussed below. 4.1 Covariance of the PoPPCA Model. We consider a product of m onefactor PPCA models, denoted as (1, m ) PoPPCA. Geometrically, the probability contours of a one-factor PPCA model in Rd are d-dimensional hyperellipsoids elongated in one dimension and contracted in the other d ¡ 1 directions. The covariance matrix of 1-PPCA is expressed as C D wwT C s 2 I ½ Rd£d ,
(4.1)
Products of Gaussians and Probabilistic Minor Component Analysis
1175
where weight vector w ½ Rd denes the direction of elongation and s 2 is the variance in the direction of contraction. Its inverse covariance is C ¡1 D b Id ¡
b 2 wwT 1 C wT wb
D b Id ¡ bc wwT ,
(4.2)
where b D s ¡2 and c D b / (1 C kwk2 b ). b and c are the inverse variances in the directions of contraction and elongation, respectively. Plugging equation 4.2 into 2.1, we obtain C ¡1 S D
m X iD 1
Ci¡1 D b S Id ¡ WBW T ,
B D diag(b1c 1 , b2c 2 , . . . , bmc m ) ,
(4.3)
where W D [w (1) , . . . , w ( m) ] ½ Rd£m is the weight matrix with columns corresponding to the weights of individual experts, b S is the sum of the inverse noise variances for all experts, and B may be thought of as a squared scaling factor on the columns of W. We can rewrite expression 4.3 as T
d£d Q Q C ¡1 , S D b S Id ¡ WW ½ R
(4.4)
Q D WB1/ 2 are implicitly scaled weights. where W 4.2 Maximum-Likelihood Solution for PoPPCA . Our studies show that the ML solution for the (1, m) PoPPCA model can be equivalent to the ML solution for m-factor PPCA, but only when rather restrictive conditions apply. Consider the simplied case when the noise variance b ¡1 is the same for each one-factor PPCA model. Then equation 4.3 reduces to T C ¡1 S D mb Id ¡ b WC W ,
C D diag(c 1 , c 2 , . . . , c m ) .
(4.5)
An m-factor PPCA model has covariance s 2 Id C WWT and thus, by the Woodbury formula (see, e.g., Press, Teukolsky, Vetterling, & Flannery (1992)), it has inverse covariance b Id ¡ b W (s 2 Im C WT W ) ¡1 WT . The maximum likelihood solution for an m-PPCA model is similar to equation 3.5, that is, O D U (L ¡ s 2 Im ) 1 / 2 RT , but now L is a diagonal matrix of the m princiW pal eigenvalues, and U is a matrix of the corresponding eigenvectors. If O are orthogonal, and the inwe choose RT D Im , then the columns of W verse covariance of the maximum likelihood m-PPCA model has the form O CW O T . Comparing this to equation 4.5 (and setting W D W O ), we see b Id ¡ b W that the difference is that the rst term of the right-hand side of equation 4.5 is bmId , while for m-PPCA it is b Id .
1176
C. K. I. Williams and F. V. Agakov
In section 3.4 and appendix C.3 of Agakov (2000), it is shown that (for m ¸ 2) we obtain the m-factor PPCA solution when lN · li
1) PoPPCA [ (m, k) PoPPCA] by taking the product of a single m-PPCA expert with k ¡ 1 “large” spherical gaussians, with all experts having identical means. 2 4.3 Sums of Gaussian Pancakes. A gaussian pancake has C ¡1 D b 0 Id C wwT . By analogous arguments to those above, we nd that a sum of gaussian
QW Q T , where W Q is a rescaled W, with pancakes has covariance CSGP D s 2 Id ¡ W a somewhat different denition than in equation 4.4. Analogous to section 4.2, we would expect that an ML solution for the sum of gaussian pancakes would give an MCA solution when the sample covariance is near spherical, but that the solution would not have a simple relationship to the eigendecomposition of S when this is not the case. 5 Discussion
We have considered the product of m gaussian pancakes. The analytic derivations for the optimal model parameters have conrmed the initial hypothesis that the PoGP gives rise to minor component analysis. We have also conrmed by experiment that the analytic solutions correspond to the ones obtained by applying optimization methods to the log likelihood. 2 Recently Marks and Movellan have shown that the product of m one-factor factor analysis models is equivalent to a m-factor factor analyser (T. K. Marks and J. R. Movellen, Diffusion Networks, Products and Experts, and Factor Analysis. Proceedings of the 3rd International Conference on Independent Component Analysis and Blind Source Separation, 2001).
Products of Gaussians and Probabilistic Minor Component Analysis
1177
We have shown that (1, m ) PoGP can be viewed as a probabilistic MCA model. MCA can be used, for example, for signal extraction in digital signal processing (Oja, 1992), dimensionality reduction, and data visualization. Extraction of the minor component is also used in the Pisarenko harmonic decomposition method for detecting sinusoids in white noise (see, e.g., Proakis & Manolakis, 1992, p. 911). Formulating minor component analysis as a probabilistic model simplies comparison of the technique with other dimensionality-reduction procedures, permits extending MCA to a mixture of MCA models (which will be modeled as a mixture of products of gaussian pancakes), permits using PoGP in classication tasks (if each PoGP model denes a class-conditional density), and leads to a number of other advantages over nonprobabilistic MCA models (see the discussion of advantages of PPCA over PCA in Tipping & Bishop, 1999). In section 4, we discussed the relationship of sums and products of gaussian models and showed that the sum of gaussian pancakes and (1, m ) PoPPCA models has rather low representational power compared to PPCA or PoGP. In this article, we have considered sums and products of gaussians with gaussian pancake or PPCA structure. It is possible to apply the sum and product operations to models with other covariance structures; for example, Williams and Felderhof (2001) consider sums and products of treestructured gaussians and study their relationship to AR and MA processes. Appendix A: ML Solutions for PoGP
In this appendix, we derive the ML equations 3.5. In section A.1, we derive the conditions that must hold for a stationary point of the likelihood. In section A.2, these stationarity conditions are expressed in terms of the eigenvectors of S. In section A.3, it is shown that to maximize the likelihood, the m minor eigenvectors must be retained. The derivations in this appendix are the analogs for the PoGP model of those in appendix A of Tipping and Bishop (1999) for the PPCA model. A.1 Derivation of ML Equations. From equation 2.2, we can specify ML conditions for parameter H of a gaussian: @L
"
N D 2 @H
¡1 @ ln | C S |
@H
¡
¡1
@tr( C S S ) @H
# D 0.
(A.1)
In our case H D fW, xg, where W is the weight matrix and x is the log of the distortion term b S . Using the rules for matrix differentiation given in Magnus and Neudecker (1999), we obtain @L @W
D N ( C S W ¡ SW ) ,
@L @x
D b S Ntr ( C S ¡ S ) .
(A.2)
1178
C. K. I. Williams and F. V. Agakov
(See Williams & Agakov, 2001, for further details.) Dropping the constant factors, we obtain ML equations for W and x: C S W ¡ SW D 0,
tr( C S ¡ S ) D 0.
(A.3)
Note that in order to nd the maximum-likelihood solution for the weight matrix W and term b S , both equations in equation A.3 should hold simultaneously. A.2 Stationary Points of the Log Likelihood. There are three classes of
solutions to the equation C S W ¡ SW D 0: W D 0I
S D CS I
6 0, S D 6 CS . SW D C S W, W D
(A.4)
The rst of these, W D 0, is uninteresting, and corresponds to a minimum of the log likelihood. In the second case, the model covariance is equal to the sample covariance, and C S is exact. In the third case, while SW D 6 C S W, S D C S and the model covariance is said to be approximate. By analogy with Tipping and Bishop (1999), we will consider the singular value decomposition of the weight matrix and establish dependencies between left singular vectors of W D ULRT and eigenvectors of the sample covariance S. U D [u1 , u2 , . . . , um ] ½ Rd£m is a matrix of left singular vectors of W with columns constituting an orthonormal basis, L D diag(l1 , l2 , . . . , lm ) ½ Rm£m is a diagonal matrix of the singular values of W, and R ½ Rm£m denes an arbitrary rigid rotation of W. A.2.1 Exact Model Covariance. Considering nonsingularity of S and C S , ¡1 ¡1 T we nd C S D S ) C ¡1 S D S . As C S D b S Id C WW , we obtain T
WWT D UL2 U D S ¡1 ¡ b S Id .
(A.5)
This has the known solution W D Um (L ¡1 ¡ b S Im ) 1/ 2 RT , where Um is the matrix of the m eigenvectors of S with the smallest eigenvalues and L is the corresponding diagonal matrix of the eigenvalues. The sample covariance must be such that the largest d ¡ m eigenvalues are all equal to b S ; the other m eigenvalues are matched explicitly. A.2.2 Approximate Model Covariance. Applying the matrix inversion lemma (see, e.g., Press et al., 1992), to expression 3.3 for the inverse covariance of the PoGP gives C S D b S¡1 ( Id ¡ W (b S Im C WT W) ¡1 WT ) ½ Rd£d .
(A.6)
Products of Gaussians and Probabilistic Minor Component Analysis
1179
Solution of equation A.3 for the approximate model covariance results in C S W D SW ) C S UL D SUL.
(A.7)
Substituting equation A.6 for C S , we obtain ¡1 C S UL D (b S Id ¡ b S¡1 W (b S C WT W ) ¡1 WT ) UL
¡1 ¡1 D (b S Id ¡ b S ULRT (b S Im C RL2 RT ) ¡1 RLUT ) UL ¡1 ¡1 D U (b S Im ¡ b S LRT (b S Im C RL2 RT ) ¡1 RL ) L ¡1 ¡1 D U (b S Im ¡ b S (b S L ¡2 C Im ) ¡1 ) L.
(A.8)
Thus, SUL D U (b S¡1 Id ¡ b S¡1 (b S L ¡2 C Im ) ¡1 ) L.
(A.9)
Notice that the term b S¡1 Id ¡ b S¡1 (b S L ¡2 C Im ) ¡1 in the right-hand side of equation A.9 is a diagonal matrix (i.e., just a scaling factor of U). Equation A.9 denes the matrix form of the eigenvector equation, with both sides postmultiplied by the diagonal matrix L. 6 0, then equation A.9 implies that If li D C S ui D Sui D li ui , ¡1
li D b S (1 ¡
(b S li¡2 C
(A.10) 1)
¡1
),
(A.11)
where ui is an eigenvector of S and li is its corresponding eigenvalue. The scaling factor li of the ith retained expert can be expressed as li D (li¡1 ¡ b S ) 1/ 2 ) li ·
1 . bS
(A.12)
This result resembles the solution for the scaling factors in the PPCA case (cf. Tipping & Bishop, 1999). However, in contrast to the PPCA solution where li D (li ¡ s 2 ) 1/ 2 , we notice that li and s 2 are replaced by their inverses. Obviously, if li D 0, then ui is arbitrary. If li D 0, we say that the direction corresponding to ui is discarded, that is, the variance in that direction is explained merely by noise. Otherwise, we say that ui is retained. All potential solutions of W may then be expressed as W D Um ( D ¡ b S I m ) 1 / 2 RT ,
(A.13)
where R ½ Rm£m is an arbitrary rotation matrix, Um D [u1 u2 . . . um ] ½ Rd£m is a matrix whose columns correspond to m eigenvectors of S, and D D diag(d1 , d2 , . . . , dm ) ½ Rm£m such that ( li¡1 if ui is retainedI (A.14) di D bS if ui is discarded.
1180
C. K. I. Williams and F. V. Agakov
A.3 Properties of the Optimal Solution. Below we show that for the PoGP model, the discarded eigenvalues must be adjacent within the sorted spectrum of S and that for the maximum likelihood solution, the smallest eigenvalues should be retained. Let r · m be the number of eigenvectors of S, retained in WML . Since the SVD representation of a matrix is invariant under simultaneous permutations in the order of the left and right singular vectors (e.g., Golub & Van Loan, 1997), we can assume without any loss of generality that the rst eigenvectors u1 , . . . , ur are retained, and the rest of the eigenvectors ur C 1 , . . . , um are discarded. In order to investigate the nature of eigenvectors retained in WML , we will express the log likelihood, equation 2.2, of a gaussian through eigenvectors and eigenvalues of its covariance matrix. From the expression for the PoGP weights, A.13, and the form of the model’s inverse covariance, 3.3, C ¡1 S can be expressed as follows: C ¡1 S D
r X i D1
ui uTi li¡1 C
d X
ui uTi b S .
(A.15)
iD r C 1
Since determinants and traces can be expressed as products and summations of eigenvalues, respectively, we see that Á ln | C S |
¡1
( d¡r)
D ln b S
tr( C ¡1 S S) D r C b S
r Y
! li¡1
iD 1 d X
D ¡
r X i D1
ln(li ) C (d ¡ r ) ln b S
li .
(A.16)
(A.17)
i Dr C 1
By substituting these expressions into the form of L for a gaussian, equation 2.2, we get " r X N L ( WML ) D ¡ ln(li ) d ln(2p ) C 2 iD 1 ¡ (d ¡ r ) ln b S C r C b S
d X
# li .
(A.18)
iD r C 1
Differentiating equation A.18 with respect to b S and equating the result to zero results in d ¡r b SML D Pd
i Dr C 1 li
.
(A.19)
Products of Gaussians and Probabilistic Minor Component Analysis
1181
We see that assuming nonzero noise, this solution makes sense only when d > m ¸ r, that is, the dimension of the input space should exceed the number of experts. Then we obtain " Pd r li N X ML L (WML , b S ) D ¡ ln(li ) C ( d ¡ r) ln iD r C 1 ( d ¡ r) 2 iD 1 # C r C ( d ¡ r)
Cc
2 3 Pd d X N4 j D r C 1 lj ( d ¡ r) ln ¡ ln(lj ) 5 C c0 , D ¡ ( d ¡ r) 2 j Dr C 1
(A.20)
(A.21)
P where c, c0 are constants, and we have used the fact that diD1 ln(li ) D ln | S | is a constant. Let A denote the term in the square brackets of equation A.21. Clearly L is maximized with respect to the l terms when A is minimized. We will now investigate the value of L (WML , b SML ) as different eigenvalues are discarded or retained, so as to identify the maximum likelihood solution. The expression for A is identical (up to unimportant scaling by ( d ¡ r) ) with expression A.5 in Tipping and Bishop (1999). Thus, their conclusion that the retained eigenvalues should be adjacent within the spectrum of sorted eigenvalues of the sample covariance S is also valid in our case. 3 This result, together with the constraint on the retained eigenvalues, equation A.12, and the noise parameter, A.19, yields Pd 8i D 1 . . . r,
li ·
j D r C 1 lj
d¡r
I
(A.22)
i.e., only the smallest eigenvalues should be retained. It can be shown (see Appendix A.3.2. in Williams & Agakov, 2001) that the log likelihood is maximized when r D m. Acknowledgments. Much of the work on this article was carried out as part of the M.Sc. project of F. A. at the Division of Informatics, University of Edinburgh. C. W. thanks Sam Roweis, Geoff Hinton, and Zoubin Ghahramani for helpful conversations on the rotcaf model during visits to the Gatsby Computational Neuroscience Unit. We also thank the two anonymous referees for their comments, which helped to improve the article. F. A. gratefully acknowledges the support of the Royal Dutch Shell 3
An alternative derivation of this result can be found in appendix B.4 of Agakov (2000).
1182
C. K. I. Williams and F. V. Agakov
Group of Companies for his M.Sc. studies in Edinburgh through a Centenary Scholarship. References Agakov, F. (2000). Investigations of gaussian products-of-experts models. Unpublished master’s thesis, University of Edinburgh. Available on-line: http:// www.dai. ed.ac.uk/homes/felixa/all.ps.gz. Bishop, C. (1995).Neural networks for patternrecognition. Oxford: Clarendon Press. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Golub, G. H., & Van Loan, C. F. (1997). Matrix computations (3rd ed.). Baltimore, MD: Johns Hopkins University Press. Hinton, G. E. (1999). Products of experts. In Proceedings of the Ninth International Conference on Articial Neural Networks (ICANN 99) (pp. 1–6). Magnus, J. R., & Neudecker, H. (1999). Matrix differentialcalculus with applications in statistics and econometrics (2nd ed.). New York: Wiley. Møller, M. (1993). A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks, 6(4), 525–533. Oja, E. (1992). Principal components, minor components, and linear neural networks. Neural Networks, 5, 927–935. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992).Numerical recipes in C (2nd ed.). Cambridge: Cambridge University Press. Proakis, J. G., & Manolakis, D. G. (1992). Digital signal processing: Principles, algorithms and applications. New York: Macmillan. Roweis, S. (1998).EM algorithms for PCA and SPCA. In M. I. Jordan, M. J. Kearns, & S. A. Solla, (Eds.), Advances in neural information processing systems, 10 (pp. 626–632). Cambridge, MA: MIT Press. Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal components analysis. J. Roy. Statistical Society B, 61(3), 611–622. Williams, C. K. I., & Agakov, F. V. (2001). Products of gaussians and probabilistic minor components analysis (Tech. Rep. No. EDI-INF-RR-0043). Edinburgh: Division of Informatics, University of Edinburgh. Available on-line: http://www.informatics.ed.ac.uk/publications/report/0043.html. Williams, C. K. I., & Felderhof, S. N. (2001). Products and sums of tree-structured gaussian processes. In Proceedings of the ICSC Symposium on Soft Computing 2001 (SOCO 2001). Xu, L., Krzyzak, A., & Oja, E. (1991). Neural nets for dual subspace pattern recogntion method. International Journal of Neural Systems, 2(3), 169–184. Received April 20, 2001; accepted September 17, 2001.
LETTER
Communicated by Steven Nowlan
Optimization of the Kernel Functions in a Probabilistic Neural Network Analyzing the Local Pattern Distribution I. Galleske gingo@zipi..upm.es J. Castellanos jcastellanos@.upm.es Departmentode InteligenciaArticial,Facultad de InformÂatica, UniversidadPolitÂecnica de Madrid, Spain
This article proposes a procedure for the automatic determination of the elements of the covariance matrix of the gaussian kernel function of probabilistic neural networks. Two matrices, a rotation matrix and a matrix of variances, can be calculated by analyzing the local environment of each training pattern. The combination of them will form the covariance matrix of each training pattern. This automation has two advantages: First, it will free the neural network designer from indicating the complete covariance matrix, and second, it will result in a network with better generalization ability than the original model. A variation of the famous twospiral problem and real-world examples from the UCI Machine Learning Repository will show a classication rate not only better than the original probabilistic neural network but also that this model can outperform other well-known classication techniques.
1 Introduction
Applications of neural network theory can be found in many areas: pattern recognition, function approximation, control, classication, and others. This article focuses on the classication problem. The neural network, that is, the classication problem solver, tries to nd out to which of some possible classes an unknown pattern has to be assigned to. Probabilistic neural networks (PNN), which are a network realization of the Bayes decision theory, are used for this purpose. This theory tries to separate the complete multidimensional input space into different regions, each belonging to one possible class. Classication then consists in deciding in which region a pattern is located. The probability density underlying the class can be found using kernel estimators (Parzen, 1962; Specht, 1990). These kernel estimators can be calculated in a one-pass learning process, resulting in a rapid training process. This PNN type has a second advantage over other types; a new training pattern can be added to an already trained network without c 2002 Massachusetts Institute of Technology Neural Computation 14, 1183– 1194 (2002) °
1184
I. Galleske and J. Castellanos
retraining it from scratch. These advantages make this network type a good candidate for real-time problems (Chen & You, 1992). Specht (1988) proposes many different kernel types in his publications and uses the multidimensional gaussian functions as kernel estimators. These kernel estimators imply the construction of multidimensional gaussian functions. Specht states that the variance of the gaussian functions, called a smoothing parameter, does not play an important role (Specht, 1988). Other researchers say that the correct value is of importance; a wrong choice of this parameter could result in misclassication errors (Specht, 1991; Yang & Chen, 1998). This value is the only variable in the model of the PNN and therefore should not be guessed; a method to construct a network from the pattern set (Berthold & Diamond, 1998) and ways to calculate these matrices (Musavi, Chan, Hummels, & Kalantri, 1993; Galleske & Castellanos, 1997) were developed. This article presents a new recursive algorithm to calculate this necessary covariance matrix of the kernel functions. Calculating the parameters needs additional computer power and results in a longer training process. Nevertheless, the learning process is still much faster than many other approaches to nd a solution to a classication problem. The disadvantage of the longer training time is well compensated in two ways. First, the developer of the probabilistic neural network is freed from guessing the covariance matrix and takes advantage of an automatic calculation. Second, and perhaps even more important, a better generalization ability and better performance on patterns not seen before, is achieved. This article begins with an introduction to the probabilistic neural network, a network realization of the Bayes decision theory. This includes its architecture and the corresponding learning procedure. This information leads to an explanation of the new, modied version of the PNN. We explain in detail how the two matrices, the rotation matrix and the matrix of variances, are calculated in various dimensions. Finally, we give the results of some experiments, which demonstrate the better generalization ability of the model we propose in this article. 2 Theoretical Foundations of Probabilistic Neural Networks
The general PNN consists of three layers of neurons (see Figure 1). The n input units indicate the n-dimensional problem space. The next layer of units (pattern layer) consists of p units, p being the number of training patterns of all classes. These two layers are completely connected, and the weights leading to unit pi are determined by its position in the n-dimensional input space. The one-pass learning of the PNN consists of setting the weights, leading to the pattern unit, to the coordinates of the pattern. The pattern units feed their output without exception only into that class unit of the next layer that represents the class it belongs to. These connections
Optimization of the Kernel Functions
1185
Figure 1: Architecture of the standard probabilistic neural network.
cannot be learned and have their weights xed to the reciprocal value of the number of units of the corresponding class. The example in Figure 1 shows a two-class classier; if there were more classes, the network had more units on the class layer. Among all units of the class layer, the one with the highest activation is chosen (in a pure connectionist model, a decision network can perform this task). The input—that is, the new pattern—is classied to the class that this unit represents. All output functions of the network are equal to the activation, o ( i) D a ( i), P for all units. The class units have the summation function a (h ) D liD k o ( i), where the pattern units k to l belong to class h. The activation functions of the units in the pattern layer represent the most important concept of the PNN, because they nally dene the complete density function of the underlying problem. Here, the use of the multidimensional gaussian function for each function is legitimated, as proved in the fundamental article by Parzen (1962). Using this activation function for all pattern units, the complete PNN calculates for each class h the following function, a ( h) D
l X iD k
o ( i) D ³
l X
1 n
iD k
1
(2p ) 2 | S | 2
´ 1 ¡1 T ¤ exp ¡ (X ¡ Xi ) Si (X ¡ Xi ) , 2
(2.1)
while the pattern from k to l participates to class h, X is the pattern to classify, Si is the matrix of covariances of pattern i, and Xi is the position of training
1186
I. Galleske and J. Castellanos
pattern i. In other words, around each training pattern, a multidimensional gaussian function is located, whose sums add up to an estimation of the probability density for a given classication problem. The general problem to solve is to nd the best estimation of the gaussian matrix Si for all training pattern i. The original PNN uses the same matrix S for all training patterns, leading to an architecture that is very fast to train. Although the training time of this architecture is short, it has the disadvantage that the generalization ability is not optimized. For enhancing the generalization ability, an algorithm is proposed here that calculates for each pattern i its own matrix Si , subdividing it into four matrices. In the following, i will be omitted for notational simplicity. Generally, the multidimensional gaussian functions have a constant potential surface (CPS; points of same probability) that is hyperellipsoid. The form of this CPS is determined by the parameter S . The idea proposed in this article is to give the ellipsoid the freedom to rotate about any angle with the aim of making it as big as possible to achieve a better network generalization ability. To realize this generic rotation, the covariance matrix S has to be subdivided into four matrices: S D R T S ¡1 S ¡1 R . S is a diagonal matrix, with its nonzero elements indicating the length of the principal axes of the ellipsoid. R is a simple rotation matrix, which rotates the ellipsoid to the desired position. The next section explains in detail how the unknown rotation matrix R and the matrix of the principal axes S are calculated.
3 Constructing the Covariance Matrix S
In the described model, the two matrices R and S for each training pattern of all classes have to be estimated from the training sample. For a classication problem in an n-dimensional space, the following variables have to be calculated: The n variances s1 , . . . , sn , which are the constituents of the variance matrix S , and the rotation matrix R . We present a recursive formula for estimating the unknown parameters of this model in a general n-dimensional input space; rst, sn and the last row vector of R have to be calculated. With the knowledge of sn , the calculation of sn¡1 , and the knowledge of the last vector of R , the second-last row vector can be estimated, and so forth, until sn , sn¡a, . . . , s2 are known and s1 can be calculated, as well as the rst row vector of R . The method for nding the unknown parameters is constructive and has to be executed for all training patterns of all classes. For the description of the algorithm, it is sufcient to make a conjunction of classes: class A contains all of the patterns under observation; all the other units of all other classes will be joined to a general class B. As a notational convenience, XiA denotes in the following training pattern Xi , which is the ith training pattern of the class A.
Optimization of the Kernel Functions
1187
Figure 2: In XiA , a new coordinate system (vE1 , vE2 , vE3 ) is constructed, and XBk is projected “elliptically” to P.
For each training pattern of class A, XiA , it is necessary to nd the closest training pattern of the opposite class (denoted XjB ). As a distance measure, there is no reason not to use simple Euclidean distance. The vector vE n , which connects these training patterns, pointing from XiA to XjB , provides one parameter, sn , which is the smallest of all principal axes. The variances of the training pattern of different classes should not overlap too much, and to prevent the overlap of a certain part of the probability mass, sn is set to a certain part ( p > 1) of the length of the connection vector: an D | vE n |, sn D 1p | vE n |. So the rst parameter of the ellipsoid, the smallest principal axis, is dened. From vE n , we make use not only of the length but also of its direction. We take vE n as the rst constituent of the rotation matrix R . After normalization to unit length, it will be the last row vector of R . Afterward, the calculation of the second principal axis an¡1 is found using the following procedure: The general function of a hyperellipsoid in an n-dimensional space is x21 a21
C
x22 a22
C ¢¢¢ C
x2i x2 C ¢ ¢ ¢ C n D 1. 2 ai a2n
(3.1)
an¡1 is found setting a1 D a2 D ¢ ¢ ¢ D an¡1 . This simplication is possible because the principal axes are not yet located in the (n ¡ 1) -dimensional subspace that is perpendicular to an , and therefore can be rotated around any angle. The calculation of an¡2 with the known variables an , an¡1 is possible, setting an¡2 D an¡3 D ¢ ¢ ¢ D a1 . This procedure continues until all an to a2 are known and the calculation of the last parameter a1 , the last and largest principal axis, is possible. Generally, to calculate a certain ai , 1 < i < n, the variables ai C 1 to an are known, and the variables a1 to ai are set equal. The three-dimensional case serves as an illustration. In Figure 2, vE3 (a3 D | vE3 |) was calculated using the Euclidean distance and points from XiA to XjB . vE3 can be interpreted as the x3 -axis of a new coordinate system originated
1188
I. Galleske and J. Castellanos
in XiA , where the x2 - and x1 -axes are located on the plane P but not dened completely, and they can be rotated freely. To calculate vE2 , the x2 -axis, we set b D a2 D a1 in formula 3.1 (representing the free rotation on P) and resolve bD[
a23 ( x21 C x22 ) 1 ]2 a23 ¡x23
for all XBk transformed into the new coordinate system with
x3 < a3 . b then gives the length of the new x2 -axis, where XBk is projected “elliptically” to P. In other words, an ellipse is constructed through XjB and XBk (see Figure 2). This method allows estimating ai , but as a distance measure, the Euclidean distance is used only for calculating an . For all other ai , the following recurrent formula has to be used (see the appendix for its derivation): 80 11 2 Á ! > > > n i > Q P B C > > a2j xj2 B C > > C > jD i C 1 jD 1 2 ¡@ >@ A @ A A a a x > j j k > > k D iC 1 jD iC 1 jD i C 1 > > kD j > > : 1 , else 6
where i starts from n ¡ 1 and decrements to 1 to calculate all the principal axes of the ellipsoid. The variances are again proportional to the length of the semi-axis of the hyperellipsoid: s1 D 1p a1 , s2 D 1p a2 , . . . , sn D 1p an ( p > 1, for example, p D 2) . The connection vectors vEi between XiA and the XjB found using the above distance measure are used to construct the rotation matrix R projecting vEi “elliptically” to the subspace, which is perpendicular to the known vectors viEC 1 , . . . , vEn . The last vector vE1 again has to be chosen in such a way that the resulting matrix R has a positive determinant. The normalized vectors vEi constitute the row vectors of the matrix R . This terminates the procedure, because all unknown parameters that have to be estimated (s1 , . . . , sn , which construct the matrix S , and the matrix R ) are known. 3.1 Computational Aspects. The central idea of the proposed approach was based in the partition of the covariance matrix S into four matrices S D R T S ¡1 S ¡1 R, and the separate calculation of the elements R and S . But looking at formula 2.1, which relates the probability with the gaussian function in a multidimensional space, ³ ´ 1 1 ¡1 T ¤ exp ¡ ( X ¡ Xp ) S ( X ¡ Xp ) , (3.3) f (X) D n 1 2 (2p ) 2 | S | 2
we can see that the covariance matrix S , which was estimated previously, is not needed explicitly. What is of importance for calculating the probability distribution is the determinant | S | and S ¡1 , its inverse matrix.
Optimization of the Kernel Functions
1189
To calculate | S | D | R T S ¡1 S ¡1 R |, the matrixes R T and S ¡1 have to be calculated from the matrixes R and S , which are known because they were estimated as explained in the previous section. S , the matrix of the variances, is a diagonal matrix with elements not equal to zero only on its diagonal. Its inversion is very simple by taking the reciprocal value of the nonzero elements. Calculating | S | reduces to a multiplication of R , S ¡1 , and R T , all of which can be easily calculated. The other component needed for calculating the gaussian function is the inverse covariance matrix S ¡1 D ( R T S ¡1 S ¡1 R ) ¡1 . It is calculated without inverting S directly, making use of the following equality, R ¡1 D R T , which holds for the rotation matrices as they are applied here:
S ¡1
¡1
¡1
D ( R T S ¡1 S ¡1 R ) ¡1 D R ¡1 S ¡1 S ¡1 R T
¡1
D R T SSR .
(3.4)
4 Experimental Results
The method described was implemented to compare the generalization ability of this model versus the original PNN and the support vector machine (SVM) as an example of a classier with similar complexity. A variation of the famous two-spiral problem, known in the area of backpropagation training and originally proposed by Alexis P. Wieland in a post to the connectionist mailing list in August 1988, and two examples taken from the UCI Machine Learning Repository (Blake & Merz, 1998), the BUPA liver disorder database and the Pima Indian diabetes diagnosis, were used as benchmark tests. 4.1 Two Spirals. The original task, as proposed by Wieland, is to classify points in the (x-y)-plane to one of two classes. The classes have the form of a spiral, making it difcult to be learned by a standard backpropagation network that uses global basis functions. The PNN seemed to offer a better solution because it uses localized functions. In this example, the training pattern formerly located in the two-dimensional x, y-plane now expands to the three-dimensional space. The two spirals are not an extension of the x, y-plane into the z-direction, but build similar spirals in the x, z- and y, z-plane. This classication problem even less appropriate for a linear classicator has to be solved by a PNN in the standard conguration, the SVM, and the model proposed here. Whereas in the original experiment with the two spirals in two dimensions, the network had to learn 194 training patterns, now in three dimensions, the number of patterns exploded to 1200 examples—600 for each class. All models were trained with a subset of the 1200 patterns that constitute the spirals. We used reduction steps of 5%, 10%, 15%, 20%, and 25%. The training set was obtained from the complete example by admitting each pattern with a probability of 0.95 (0.9, 0.85, 0.8, 0.75, respectively). The reason, that we chose this method of selecting the training pattern set for
1190
I. Galleske and J. Castellanos
Table 1: Absolute Classication Errors in 10Experiments of the Two-Spiral Problem. Reduction
5%
10%
15%
20%
25%
PNN (variance 1.0)
456.0
468.6
469.3
477.4
484.3
PNN (variance 0.1)
28.9
58.7
77.9
109.7
132.6
PNN (variance 0.01)
17.6
30.5
43.3
61.6
73.5
SVM
7.2
15.3
26.1
37.9
51.1
Proposed model
1.2
2.2
3.7
8.8
12.3
our test case is that we will get a random distribution and come to holes in the spirals that could be small (deleting only one training pattern) and perhaps could be easily generalized. But we would also get big holes of two or three or even more missing patterns in the big reduction experiments, where the networks would surely have more problems to complete these missing patterns. The test set consisted of the complete 1200 characteristics. The summary of this experiment is shown in Table 1. From the rst three rows, one can observe that the generalization ability of the PNN is better the smaller the variance is. The reason for this classication error lies in the high pattern density of the center of the spiral that occupies and overwrites the low-density region of the outer arms. Although the SVM performs much better (fourth row), reducing the classication error considerably, the model proposed here takes this property into account, leading to a very small misclassication. Another observation is that the quality of the models is independent of the reduction steps. 4.2 BUPA Liver Disorder. So as not to depend solely on the results from the anterior theoretical experiment, we also applied the different models of classication networks to real-world problems, taken from the UCI Machine Learning Repository (Blake & Merz, 1998). The rst example is the BUPA liver disorder database. The available 345 instances were divided into a training set of 150 randomly chosen patterns and a test set of 345 patterns. Table 2 summarizes the results, taking the 10 experiments with different training sets. In the rst ve columns are the absolute numbers of misclassications of the original PNN with different variances of the covariance matrix of this model. To nd the best results for the SVM, various kernel functions were compared, and nally radial basis functions were used, because their generalization ability was superior to other kernels. The last column shows the results of the model proposed in this article. Although all results are very similar, the SVM has a better performance than the standard PNN, and the model proposed in this article outperforms the other models.
Optimization of the Kernel Functions
1191
Table 2: Absoute Classication Errors in 10 Experiments of the Models in the BUPA Liver Disorder Problem. Probabilistic Neural Network Variance: 0.01
Variance: 0.1
Variance: 1.0
Variance: 10
Variance: 20
Support Vector Machine
Proposed Model
85.3
82.9
82.1
81.1
80.9
78.0
75.9
Table 3: Absolute Classication Errors in 10 Experiments of the Different Models in the PIMA Indian Diabetes Diagnosis Problem. Probabilistic Neural Network Variance: 0.01
Variance: 0.1
Variance: 1.0
Variance: 10
Variance: 20
Support Vector Machine
Proposed Model
212.6
213.4
213.7
214.0
216.4
71.8
56.1
4.3 PIMA Indians Diabetes Diagnosis. The second real-world example also demonstrates the effectiveness of the new model. The test setup consisted of 10 examples of 768 test patterns and 576 training patterns (75% of the test set). Table 3 shows the absolute classication error over the 10 examples for the original PNN, the SVM with radial basis function kernel (because it had the best generalization ability), and the new model. As in the anterior experiments, the proposed model generalizes better than the original PNN and SVM. 5 Conclusion
Estimating the covariance matrix is a problem in the PNN model. This work presents a technique to overcome this limit and automatically calculate the covariance matrix of all training patterns of the network only by looking at its local environment. Therefore, the matrix S was divided into submatrixes that were estimated separately, making the calculation of each one simple. A theoretical and two real-world (BUPA liver disorder, PIMA Indians diabetes) experiments were executed to compare the generalization ability of the proposed model with the original PNN and the SVM, a model of similar complexity. In all tests, the classication accuracy of the new model was better, as shown in an enhanced ability to generalize to unseen pattern. The results are very encouraging, so future work will try to reduce the network size of the PNN using some kind of clustering technique to take advantage of the calculated covariance matrix, but without losing the enhanced ability to generalize to unseen patterns.
1192
I. Galleske and J. Castellanos
Appendix: Derivation of the Distance Measure
For the method to calculate the distance between two training pattern in the n-dimensional problem space, two distance measures are used. For the rst dimension, simple Euclidean distance measure was used. But from the second dimension on, a special elliptical distance is applied. The derivation of this distance measure is presented here. For calculating the distance between two training pattern XaA and XBb in the dimension i, 1 · i < n, holds the following equation (see section 3): x2i¡1 x2i C 1 x2n¡1 x21 x2 x2 C ¢¢¢ C C i C C ¢¢¢ C C n D 1. b2 b2 b2 a2n a2i C 1 a2n¡1
(A.1)
x1 to xn are the coordinates of the pattern XbB transformed into the coordinate system located at XaA . an to ai C 1 are known variables from earlier calculations. The variables a1 to ai are replaced by a single placeholder b, which has to be calculated. Transferring all summands that do not contain b to the right side of the equation and resuming the Qleft side and extending all fractions on the right side of the equation with jnD i C 1 aj2 yields n Q
x21 C
¢¢¢ b2
C x2i
D
jD i C 1 n Q
jD i C 1
n Q
aj2 ¡
aj2
jD i C 1 n Q
jD i C 1 n Q
¡ ¢¢¢ ¡
j Di C 1 n Q
jD i C 1
aj2 x2i C 1 aj2 a2i C 1 n Q
aj2 x2n¡1 aj2 a2n¡1
¡
j Di C 1 n Q
j Di C 1
aj2 x2n .
(A.2)
aj2 a2n
To reduce the right side to a common denominator, the aj2C can be eliminated, Q leaving out the respective term in the jnDi C 1 aj2 of the numerator: i P jD 1
n Q
xj2
b2
D
jD i C 1 n Q
jD i C 1
n Q
aj2 aj2
¡
jD i C 1 jD 6 iC1
n Q
aj2 x2i C 1
n Q j Di C 1
aj2
¡¢¢¢¡
jD i C 1 jD 6 n¡1
n Q
aj2 x2n¡1
n Q j Di C 1
aj2
¡
Examining the numerator we nd the following: In the
jD i C 1 jD 6 n
aj2 x2n
n Q j Di C 1
.
(A.3)
aj2
Qn jD iC 1 jD 6 k
are left out
exactly those a’s that have the same indices k as the x k they are multiplied
Optimization of the Kernel Functions
with:
Qn jD i C 1 jD 6 k
way:
aj2 x2k . Therefore, we combine the summands in the following 0
i P j D1
n Q
xj2 D
b2
1193
jD i C 1
aj2 ¡ @
0 n P
1 n Q
@
kD i C 1 n Q j Di C 1
jD i C 1 jD 6 k
1
aj2 A x2k A .
(A.4)
aj2
Resolving this equation to b, the unknown variable gives the distance between the two training patterns: 0
11 2
n Q
i P
B C B C aj2 xj2 B C j Di C 1 jD 1 B C 0 0 1 1C . bDB B n C n n P Q B Q 2 C @ @ aj ¡ @ aj2 A x2k A A jD i C 1
kD i C 1
(A.5)
jD i C 1 jD 6 k
References Berthold, M., & Diamond, J. (1998). Constructive training of probabilistic neural networks. Neurocomputing, 19, 167–183. Blake, C., & Merz, C. (1998). UCI Repository of Machine Learning Databases. Irvine: University of California, Irvine, Department of Information and Computer Sciences. Available on-line: http://www.ics.uci.edu/»/mlearn/ MLRepository.html. Chen, C., & You, G. (1992). ISBN recognition using a modied probabilistic neural network (PNN). Proceedings of the 11th IAPR International Conference on Pattern Recognition, 2, 419–421. Galleske, I., & Castellanos, J. (1997). Probabilistic neural networks with rotated kernel functions. In Proceedings of the 7th International Conference on Articial Neural Networks ICANN’97 (pp. 379–384). Lausanne, Switzerland. Musavi, M. T., Chan, K., Hummels, D., & Kalantri, K. (1993). On the generalization ability of neural network classiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(6), 659–663. Parzen, E. (1962). On estimation of a probability density function and mode. Annals of Mathematical Statistics, 33, 1065–1076. Specht, D. (1988). Probabilistic neural networks for classication mapping, or associative memory. Proceedings, IEEE International Conference on Neural Networks, 1, 525–532. Specht, D. (1990). Probabilistic neural networks. Neural Networks, 3, 109– 118.
1194
I. Galleske and J. Castellanos
Specht, D. (1991). Generalization accuracy of probabilistic neural networks compared with backpropagation networks. IJCNN-91-Seattle: International Joint Conference on Neural Networks, 1, 887–892. Yang, Z., & Chen, S. (1998). Robust maximum likelihood training of heteroscedastic probabilistic neural networks. Neural Networks, 11, 739–747. Received November 6, 2000; accepted September 27, 2001.
LETTER
Communicated by Stephen JosÂe Hanson
Methods for Binary Multidimensional Scaling Douglas L. T. Rohde
[email protected] School of Computer Science, Carnegie Mellon University, and the Center for the Neural Basis of Cognition, Mellon Institute, Pittsburgh, PA 15213 Multidimensional scaling (MDS) is the process of transforming a set of points in a high-dimensional space to a lower-dimensional one while preserving the relative distances between pairs of points. Although effective methods have been developed for solving a variety of MDS problems, they mainly depend on the vectors in the lower-dimensional space having real-valued components. For some applications, the training of neural networks in particular, it is preferable or necessary to obtain vectors in a discrete, binary space. Unfortunately, MDS into a low-dimensional discrete space appears to be a signicantly harder problem than MDS into a continuous space. This article introduces and analyzes several methods for performing approximately optimized binary MDS. 1 Introduction
Recent approaches to articial intelligence and machine learning have come to rely increasingly on data-driven methods that involve large vector spaces. One application of high-dimensional vectors that is particularly relevant today is in representing the contents of large collections of documents, such as all texts available on the Internet. The similarity structure in these vector spaces can be exploited to perform a variety of useful tasks, including searching, clustering, and classication (see, e.g., Deerwester, Dumais, Furnas, Landauer, & Harshman, 1990; Berry, Dumais, & O’Brien, 1994). Other popular applications of vector spaces include representing the content of images (Beatty & Manjunath, 1997) and the meanings of words (Lund & Burgess, 1996; Burgess, 1998; Clouse, 1998). However, it is often inefcient, if not intractable, to perform complex analyses directly in high-dimensional vector spaces. If one could reduce the set of high-dimensional vectors to a set of vectors in a much lowerdimensional space while preserving their similarity structure, operations could be performed more efciently on the smaller space with the potential added benet of improved results due to reduced noise and greater generalization. Scaling to a space with just one, two, or three dimensions also permits easy visualization of the resulting space, which can lead to a better understanding of its overall structure. c 2002 Massachusetts Institute of Technology Neural Computation 14, 1195– 1232 (2002) °
1196
Douglas L. T. Rohde
In order to place the current work in a historical framework, let us briey trace the development of modern multidimensional scaling (MDS) techniques. Most applications of MDS, particularly in the psychological domains, have been in the analysis of human similarity ratings. Thus, rather than beginning with points in a high-dimensional vector space, a more common starting point has been a matrix of pairwise comparisons of a set of items. Various types of comparisons might be used, including similarity or dissimilarity judgments, confusion probabilities, or interaction rates. One problem introduced by the use of measures of this sort is that it is not clear how best to scale these ratings so that they correspond directly to distances in the vector space. Subjects’ ratings may be quite skewed and are likely to be nonmetric. Perhaps the earliest explicit and practical MDS method was that of Torgerson (1952), which grew out of the work of Richardson (1938) and Young and Householder (1938), among others. Torgerson used a one-dimensional scaling technique to convert dissimilarity ratings to target distances and then attempted to nd a set of points whose pairwise Euclidean distances best matched the target distances according to mean-squared error. The initial scaling function might simply be a linear transformation or could be a nonlinear function, such as an exponential. Although the formal requirements of this technique are quite effective, they are too strong for many applications, and a serious drawback is that the proper scaling method is difcult to determine and may vary from one problem to the next. The next major advance was made by Shepard (1962), who suggested that rather than attempting to match scaled target distances directly, the goal of MDS should be to obtain a monotone relationship between the actual point distances and the original dissimilarities. Thus, the dissimilarities need not be scaled, and their values are actually discarded altogether; all that is retained is their relative ordering. Torgerson’s earlier approach came to be known as metric MDS and this new technique as nonmetric MDS. However, Shepard did not provide a mathematically explicit denition of what constitutes a solution. Kruskal (1964a, 1964b) further developed the method by explicitly dening a function, known as stress, relating the pairwise distances and the ranking of dissimilarities. Stress essentially involves the scaled sum-squared error between the pairwise distances and the best-tting monotonic transformation of the original dissimilarity ratings. Iterative gradient descent was used to nd the conguration with minimal stress. The basic technique developed by Shepard and Kruskal has remained the standard for most applications of MDS to psychological phenomena (Shepard, Romney, & Nerlove, 1972; Borg & Groenen, 1997). Although possibly slower, gradient descent techniques have the advantage over matrix algebra methods in that they can more easily tolerate missing or sparse data and can be used to minimize any differentiable measure of stress. Nonmetric MDS is quite effective when similarity ratings involve unknown distortions. How-
Binary Multidimensional Scaling
1197
ever, relying on rank order sometimes discards information that cannot be recovered (Torgerson, 1965). This may be particularly true in cases where the structure of the data involves a number of tight clusters that are well separated (Shepard, 1966). Thus, metric methods may be more suitable for some types of data, but for the vast majority of problems of practical interest, nonmetric methods are likely to be as good, and perhaps better. A common feature of the MDS techniques discussed thus far is that they rely on the nal vector space having real-valued components. However, some applications require vectors with discrete, usually binary components. That is, the vectors should lie at the corners of a unit hypercube. An important application of this type is the development of representations for training neural networks. Increasingly, neural networks that serve as cognitive models are trained using inputs or targets derived from real data rather than articially generated vector sets. But those data sets may involve vectors of high dimensionality, possibly in the tens or hundreds of thousands, and it would be computationally intractable to train a network using such large vectors. Furthermore, neural networks with thresholded outputs, particularly recurrent attractor networks (Pearlmutter, 1989; Plaut & Shallice, 1993), often learn better when their vector targets use binary components. It is harder for the network to drive output units accurately to intermediate levels of activation than to drive them to fully active or inactive states. Thus, it is sometimes necessary to scale a set of high-dimensional vectors to a relatively low-dimensional, binary vector space. Binary MDS (BMDS) is a much harder problem than standard MDS. In fact, it has been shown that embedding a metric distance space in a bit space with minimal distortion is NP-complete (Deza & Laurent, 1997). 1 Thus, there is a good chance that no polynomial-time algorithm exists to compute an optimal set of BMDS vectors. However, it may still be possible to compute efciently an approximation to the optimal BMDS solution. This article presents several methods for performing approximately optimal BMDS. The solutions fall in two broad classes: those that perform the optimization directly in bit space and so-called hybrid methods that perform the optimization in a real-valued space before converting to a bit space. The rst hybrid method is somewhat similar to Torgerson’s linear-algebraic approach to MDS. It begins by computing the singular value decom-
1 Actually, it is NP-complete to decide whether a metric distance space is `1 embeddable with no distortion. It follows that nding a minimally distorted embedding of a metric space into a binary space under an `1 distance measure (such as Hamming distance) is NP-complete. However, the original set of distances in the BMDS problems considered here is more restricted than a metric space because they represent distances between pairs of points. Thus, the known proof may not apply. Nevertheless, it seems likely that deciding whether an `2 - or `1 -embeddable metric space is embeddable in an `1 -space of lower dimensionality is also an NP-complete problem.
1198
Douglas L. T. Rohde
position of the matrix of initial vectors. The right singular vectors are then converted to bits using a unary encoding, with more bits assigned to vectors having larger singular values. Two other hybrid methods are based on Shepard and Kruskal’s technique for MDS. Gradient descent is performed in a real-valued vector space using the stress cost function before the vector components are converted to bits based on their signs. One of these variants is metric in that it uses the actual target values in computing the stress. The other method uses a monotonic transformation of the target values rather than the values themselves. The major problem with hybrid techniques is that important information can be lost in the discretization. A very good real-valued solution may turn into a very bad binary solution. An alternative approach is to perform the bulk of the optimization directly in bit space. The only known prior algorithm for computing BMDS directly in bit space is that of Clouse and Cottrell (1996) and Clouse (1998). Their method began by creating a set of bit vectors for each item by thresholding values from the original high-dimensional vector. They then performed a random walk by repeatedly selecting bits at random, computing whether ipping the bit would improve the overall cost, and doing so only when benecial. Once it was determined that no improvements could be made by ipping any one bit, the algorithm terminated. One drawback of the Clouse and Cottrell method is that it is computationally inefcient. Without good record keeping, it is costly to determine whether a bit ip is advantageous. As the number of remaining good bits diminishes, the algorithm becomes less and less efcient because many bits must be tested before any progress can be made. The alternative— exhaustively testing all bits before deciding which to ip—is not much better. Thus, the algorithm does not scale well to larger problems. The rst fully binary method presented here is an improved version of Clouse and Cottrell’s algorithm. By using careful record keeping, it is able to keep track of the change in cost that would result from ipping any bit and to quickly nd the bit that would result in the greatest improvement. Although there is some cost for the record keeping, it is more than made up for by the fact that the algorithm need never test a bit only to discover that ipping it would be counterproductive. A more effective, though less efcient, version of this algorithm minimizes the sum of the squared difference between actual and target distances rather than the sum of the absolute differences. The nal method introduced in this study constructs the bit vectors sequentially, choosing the rst bit in each vector, followed by the second bit in each vector, and so on. The algorithm then iterates several times, reassigning the bits in each dimension given the other bits previously chosen. Although simple and relatively easy to implement, this method is quite fast and effective. The next section introduces the tasks and metrics used to evaluate the BMDS methods. Each method is then described in further detail, includ-
Binary Multidimensional Scaling
1199
ing the details of its implementation, advantages and disadvantages, and some possible variations. Finally, the methods are evaluated in terms of performance and running time. It is hoped that this study will prove useful to researchers interested in immediate applications of binary multidimensional scaling and that it will inspire future advances in these methods. 2 Evaluation Metrics and Example Tasks
Before describing the actual BMDS algorithms, we begin by dening the scaling task more explicitly. The input is a set of N real-valued vectors of dimensionality M, representing N items. The output is a set of N bit vectors with dimensionality D. The goal is for the relative distances between the nal vectors to reect the relative distances between the initial vectors as closely as possible. To make this more concrete, we must dene the functions measuring pairwise distance in the original and nal spaces and a measure of how well these two sets of distances agree. There are a number of reasonable distance metrics for the original space, four of which are shown in Table 1. Euclidean distance and city-block distance are standard choices. However, they are dependent on the dimensionality, M, and the scaling of the vectors, making them somewhat inconvenient. Cosine is another reasonable choice. It is scale invariant and is conned to a xed range, [¡1, 1], which is more convenient than measures that depend on dimensionality and average value magnitude. A fourth possibility, and the one used in this study, is to base the distance measure on Pearson’s correlation. Like cosine, it is scale invariant and is conned to a xed range, [¡1, 1]. Computationally, correlation and cosine are identical except that cosine is calculated using the actual vector components, while correlation is based on the differences between vector components and their mean. If components are evenly distributed between positive and negative values, their mean is usually close to zero, and cosine and correlation are quite similar. But if components are constrained to be nonnegative, cosine will be positive while correlation continues to use the full [¡1, 1] range. Correlation is thus a good initial choice for many scaling problems. In practice, using correlation has led to better results than using the other three measures with several different tasks and BMDS algorithms. In order to turn correlation into a distance measure, it is scaled by ¡0.5 and shifted by 0.5 so that a correlation of 1.0 becomes a distance of 0 and a correlation of ¡1.0 becomes a distance of 1.0. The set of all pairwise correlation distances was then scaled by a constant factor to achieve a mean value of 0.5, although this has little practical effect because the mean distance tended to be very close to 0.5 before scaling. This linearly transformed correlation will be referred to as correlation distance. Although it was not done here, one could scale the resulting values using an exponential with an exponent greater than 1 to increase the inuence of larger distances or less than 1 to enhance the smaller distances.
1200
Douglas L. T. Rohde
Table 1: Some Candidate Distance Measures. Distance Measure
Formula
City-block
P |x ¡ y k | k k pP
Euclidean Cosine Correlation
k
(xk ¡ y k )2
P
0.5 ¡ 0.5 pP
x y k k k x2 y2 k k k k
P P (xk ¡x)(y N k ¡Ny) 0.5 ¡ 0.5 pP k P k
(xk ¡Nx)2
k
(yk ¡Ny)2
The simplest and most reasonable choice for the distance metric in bit space seems to be Hamming (city-block) distance. That is, the distance between two vectors is the number of bits on which they differ. Note that for bit vectors, Euclidean distance is just the square root of the Hamming distance. Likewise, if the bit vectors tend to have a roughly equal number of 1s and 0s, which seems to be the case with most of these BMDS methods in practice, Hamming distance is closely approximated by correlation distance (when scaled by D). The third function that we must specify evaluates the agreement between the correlation distances in the original space and the Hamming distances in the nal binary space. It’s not entirely clear what is the best measure. One obvious choice is to use Kruskal’s stress (Kruskal, 1964a): vP u u i < j (dij ¡ tij ) 2 Metric stress D t , P 2 i < j dij where i and j together iterate over all pairs of items, dij is the Hamming distance of i and j’s bit vectors, and tij is the correlation distance of i and j’s initial vectors, scaled by the dimensionality of the bit vectors, D. An alternative form of this measure, and the one that Kruskal actually used, is nonmetric stress. In this case, the actual distances are not directly compared to the target distances but to the best monotonic transformation of the target distances: vP u u i < j ( dij ¡ dOij ) 2 Nonmetric stress D t , P 2 i < j dij where dOij are those values that achieve minimal stress, under the constraint that the dOij have the same rank order as the corresponding tij . Nonmetric stress is a better measure if one is concerned only with preserving the rankorder relationship between pairwise distances. But if one is concerned with directly matching the target distances, metric stress is preferable.
Binary Multidimensional Scaling
1201
An alternative method of evaluating the nal vectors is to compute Pearson’s correlation between the original set of pairwise distances among vectors and those of the nal vectors. This measure is referred to here as goodness and is dened mathematically as follows: P
Goodness D qP
¡ dN) ( tij ¡ tN) . P N2 N2 i < j ( dij ¡ d) i < j ( tij ¡ t ) i < j ( dij
Note that the optimal stress value is 0 and the optimal goodness value is 1; better vectors should result in lower stress but higher goodness. Goodness has the property that it is unaffected by linear transformations of the distances, so scaling and shifting the target distances has no effect on goodness. Because metric stress, nonmetric stress, and goodness do not always agree on which is the best set of vectors, all three measures are reported in the analysis in section 9. In section 9.3, the practical differences between these measures is discussed. 2.1 Example Tasks. Two BMDS tasks were used in testing the algorithms presented here. The rst, known as the Exemplar task, was completely articial. It consisted of 4000 vectors of dimensionality 1000, generated in the following way. First, 10 random bit vectors of length 50 were produced. Each of the 4000 vectors was created by taking one of the 10 exemplars, ipping each bit with 10% chance, and then doing a random projection to real-valued 1000-dimensional space. The resulting vector set has a basically simple similarity structure with a good deal of randomness superimposed. Because the vectors occupied a 50-dimensional bit space prior to the random projection, the vectors should be quite compressible. The second task, the Word task, involves 5000 vectors of length 4000 representing word meanings. The vectors were generated using a method similar to HAL (Lund & Burgess, 1996). Word co-occurrences were gathered over a large corpus of Usenet text. Raw co-occurrence counts were converted to a ratio of the conditional probability of one word occurring in the neighborhood of another to the word’s overall probability of occurrence. The rst 4000 values of each word’s vector, reecting its co-occurrences with the 4000 other most frequent words, was taken as the word’s meaning vector. This set has a more complex similarity structure than the Exemplar set. Note that these problems are considerably larger than those to which MDS is typically applied, which generally involve no more than a few hundred items. Clouse and Cottrell (1996) reported an example task involving 233 words. Because any reasonable MDS algorithm will likely have a running time that is at least O ( N2 ) , the tasks studied here are effectively several hundred times more complex. Of critical concern will be not only the ability to achieve low stress or high goodness, but the running time of the various algorithms.
1202
Douglas L. T. Rohde
3 The Singular Value Decomposition (SVD) Method
The singular value decomposition (SVD) is the foundation for the latent semantic analysis technique for document indexing and retrieval (Deerwester et al., 1990). It has also recently received considerable attention for its use in efcient clustering methods (Frieze, Kannan, & Vempala, 1998). It therefore seems natural to consider designing a BMDS algorithm using SVD. This rst method, which is based on computing the SVD of the item vector matrix, is somewhat related to the metric MDS technique of Torgerson (1952). Any real matrix, A, has a unique SVD, which consists of three matrices, U S V, whose product is the original matrix. The rst of these, U, is composed of orthonormal columns known as the left singular vectors, and the last, V, is composed of orthonormal rows known as the right singular vectors. S is diagonal and contains the singular values. The singular vectors reect principal components of A, and each pair has a corresponding value, the magnitude of which is related to the variance accounted for by the vector. If A is symmetric and positive semidenite, the left and right singular vectors will be identical and equivalent to its eigenvectors, and the singular values will be its eigenvalues. Nonbinary multidimensional scaling can be performed using the SVD as follows. Let A be the M £ N matrix whose columns are the original item vectors. The SVD is computed and the right singular vectors are sorted by decreasing singular value. Only the rst D vectors and values are retained. The new representation of item i is the vector composed of the ith value in each of the D highest right singular vectors, scaled by its corresponding singular value. 3.1 Discretization. In order to perform binary MDS, the values must be converted to bits. One could simply use the rst D right singular vectors and assign a single bit to each component. But this would not be very effective because the vectors with highest value contain most of the useful variance. Furthermore, there are often fewer than D nonzero singular values. Thus, we may need to assign more than one bit to each right singular vector. The method found to be most effective is to assign the bits roughly in proportion to the magnitudes of the singular values. A deterministic procedure is used to accomplish this. The rst bit is assigned to the vector with the largest singular value. Its value is then halved. The second bit is assigned to the vector with the singular value that is now the largest. Once two bits have been assigned to a vector, its value is set to one-third of its original value. For three bits, its value is one-fourth of the original, and so on. Assume, for example, that we had three singular vectors with values 12, 9, and 5, and we were to assign 5 bits (D D 5). The rst bit goes to the rst vector, and its value is reduced to 6. The second bit goes to the second vector, and its value is reduced to 4.5. The third bit goes to the rst vector again, because it is once again the largest value. Its value is set to 4 (12/3).
Binary Multidimensional Scaling
1203
Bit 4 goes to the third vector, because it is now the largest, and its value is reduced to 2.5. The remaining values are 4, 4.5, and 2.5, and the last bit goes to the second vector. In the end, the rst two vectors have been assigned two bits and the last vector one. The bits are then given values using a unary encoding. If a vector has three bits assigned to it, the possible codes are 000, 001, 011, and 111. This may not seem to be the most efcient use of the bits, but it is the only method that has the appropriate similarity structure between codes. The codes are assigned to items as follows. Each right singular vector has N components corresponding to the N items. These are sorted and partitioned evenly into the same number of bins as there are unary codes. With three bits, there would be four bins. The items in the rst bin would receive the bits 000, those in the second bin would receive the bits 001, and so on. 3.2 Running Time. A major problem with the SVD method is that computing the SVD is quite slow. Because the matrices are dense, computing the SVD takes H ( N 2 ( N C M ) ) time. If N is larger than M, the matrix A can be transposed and the left singular vectors used, giving a running time of H ( M2 ( M C N ) ) . However, it is no faster to run the SVD algorithm with small D than with large D. Using a 500 MHz 21264 Alpha processor, it takes 25 minutes to compute the SVD on the Exemplar task. On the Word problem, with N D 5000 and M D 4000, it runs for nearly a day. 3.3 Variants. If one knows that a certain subset of items is representative of the others, it is possible to speed up the SVD computation by using only that subset to generate the singular vectors. Although this might enable the SVD method to tackle much larger problems, it hinders performance. As we will see in section 9, the SVD method of BMDS does quite poorly to begin with. Several alternative discretization methods have been tested but were not found to be as effective. 4 The Metric Gradient Descent (MGD) Method
The second hybrid method uses gradient descent to optimize the item vectors in a real-valued space before discretizing them. It is similar to more traditional MDS in the use of the stress cost function and gradient descent but differs from them in that it is metric. The stress measure uses linearly transformed target distances rather than monotonically transformed targets. 4.1 Initialization. The rst step of the gradient descent method is to create an initial set of N real-valued vectors of dimensionality D. The vectors could be assigned randomly, as is typically done in standard MDS. However, this tends to result in an unnecessarily long minimization process. A better approach is to use a fast method to create moderately good initial vectors. One nice way to produce good initial vectors is with a random projection.
1204
Douglas L. T. Rohde
First, D random basis vectors of dimensionality M are generated. The elements of the basis vectors are draw independently from a gaussian distribution. For each item, the correlation between its original vector, having dimensionality M, and each of the D basis vectors is computed, and these D correlations form the components of the item’s initial vector in the smaller space. This random projection is reasonably fast, H ( NMD) , and preserves much of the information in the original vectors, especially if D is large. Even without the minimization phase, the random projection goes a long way toward solving the BMDS problem. However, there is still considerable room for improvement to justify the more expensive optimization process. 4.2 The Cost Function and Its Derivative. The goal of the optimization phase is to minimize the stress between the actual vector distances and the scaled target vector distances: vP r u u i < j ( dij ¡ b tij ) 2 S¤ t . SD D P ¤ 2 T i < j dij
i and j together iterate over all pairs of items, dij is the city-block distance of i and j’s vectors in the new space, tij is the correlation distance of i and j’s initial vectors, and b is an adjustable scaling factor. At the start of each step of the iteration, the value of b is computed that results in minimal stress. This method is commonly known as ratio MDS, as it seeks to minimize the discrepancy between actual and target distance ratios. The optimal value of b is given by P i < j dij tij . bD P 2 i < j tij Next, the derivative of the stress with respect to each of the ND vector components is computed. This derivative is given by the following formula: Á p ! X dij ¡ b tij dij S¤ @dij @S p ¡ p . D @i k S¤ T ¤ T ¤ T ¤ @ik j When using city-block distance, the derivative of distance dij with respect to component ik is simply sgn (ik ¡ jk ), or 1 if ik > jk and ¡1 otherwise. 4.3 Component Updates. Once the ND derivatives have been accumulated over all item pairs, the vector components are updated by taking a small step down the direction of steepest descent. The size of the step is scaled by a learning rate parameter, a. Following Kruskal (1964b), it is also scaled by the root mean square (RMS) value of all vector components and
Binary Multidimensional Scaling
1205
the inverse of the RMS derivative. The reason for this scaling is that it reduces the need to otherwise adapt the learning rate to the size of the overall problem. The component update formula is as follows: qP i,b ib @S . i k D i k ¡ a qP @S @ik i,b @ib
In order to prevent the vector components from shrinking or growing to a point where precision is lost, they are normalized following each update to maintain an RMS value of approximately 1. At the end of the minimization, the real-valued components are converted to bits based on their signs. Negative components become 0s, and positive components become 1s. 4.4 Learning-Rate Adaptation and Stopping Criterion. Despite the learning-rate scaling factors, it is still necessary to adapt the learning rate as the minimization proceeds. In general, a larger learning rate is used initially and then progressively reduced as a minimum is approached. The initial value of the learning rate was 0.2, which has proven to be a good choice for tasks of widely varying size. The general problem of automatically adapting the learning rate during a gradient descent to achieve the best performance is an interesting and difcult one. The following method is based on observations of what experienced humans do when adjusting a learning rate by hand. It is by no means optimal, but it does seem to work quite well for any gradient descent that has a smooth, fairly stable error function, such as the current problem or when training neural networks under batch presentation. The learning-rate and termination criterion are based on two measures, known as progress and instability. Progress is the percentage change in overall stress following the last weight update and is dened as:
Progress D Pt D
St¡1 ¡ St , St¡1
where St is the current stress and St¡1 is the previous stress. A positive P value indicates that the stress is being reduced, which is good. If P ever becomes negative, the learning rate is immediately scaled by 0.75. This normally results in a return to positive progress on the next update. Instability is a time-averaged measure of the consistency of the progress and is dened as follows: ³ ´ Pt¡1 ¡ Pt . Instability D It D 0.5 ¤ It¡1 C P t¡1 Steady progress results in low instability. Whenever the learning rate changes, instability is reset to 10. If progress is high and the instability
1206
Douglas L. T. Rohde
is low, things are proceeding well, and no changes are needed. Unstable progress often indicates that the learning rate is a bit too high, but it is usually not worth lowering the rate unless negative progress is made. The only case where it is generally a good idea to increase the learning rate is when progress is slow and stable. Thus, whenever the progress is less than 0.02 (2%) and the instability is less than 0.2, the learning rate is scaled by 1.2. The minimization terminates when the progress remains below 0.001 (0.1%) for 10 consecutive updates. On the example tasks used here, the algorithm generally terminates between 50 and 250 updates, depending on the values of N and D. 4.5 Running Time. As with any other gradient-descent technique, it is difcult to predict in advance how many updates will be required. In general, the length of the settling process increases a bit with larger N and D. The cost for each update is H ( N2 D ). Therefore, the algorithm tends to be somewhat worse than quadratic in N and somewhat worse than linear in D. The running time is evaluated empirically in section 9.4. 4.6 Variants. Gradient descent methods are extremely exible and permit endless variation. In addition to the stress cost function and the cityblock distance function, summed and sum-squared cost have also been tested, as well as Euclidean distance. None of these alternatives performed as well with metric gradient descent on the current tasks as did stress and city-block distance. Furthermore, various methods of scaling the target distances have also been tested. Rather than scaling the targets by an adaptive factor, b, they could simply be used in their raw form of correlation distances scaled by D. Alternately, one could transform the distances by a C b tij , where a and b are both adjusted to minimize the overall cost. This is known as an interval scale. Using an interval scale provided more exibility but proved slightly less effective on the current problems. One could loosen the restrictions on the target distances by using the optimal monotonic transformation, as in Kruskal and Shepard’s nonmetric MDS technique (but that is the subject of the next algorithm). It may be possible to improve the rate of convergence with a better automated procedure for adjusting the learning rate. Another promising addition would be the use of a momentum term on the component update step, as is often done in training neural networks. Momentum adds a fraction of the previous step in vector space to the current step, which can often speed learning. Finally, as one might expect, a major problem with this method of BMDS is that signicant information is lost in the discretization step. Although the city-block distances between pairs of vectors may match the target distances very well, those city-block distances will not accurately reect the Hamming distances of the corresponding bit vectors unless the real-valued
Binary Multidimensional Scaling
1207
components are close to ¡1 or 1.2 Thus, it may be benecial to introduce an additional cost term that penalizes components for being far away from ¡1 or 1. A simple polarizing cost function is the absolute value of the distance between the value and ¡1 or 1, whichever is closer. This would add a constant-size term to the value on each update, having the effect of pulling the value toward the closer of 1 and ¡1. Like the learning rate, the size of that step is an adjustable parameter. A somewhat more sophisticated polarizing cost function is (i4k ¡2i2k C 1) / 4. This is shaped like a smooth W. It has concave upward minima at 1 and ¡1 and a concave downward maximum at 0. At values larger than ¡1 or 1, it increases rapidly, heavily penalizing large values. It is nice because there are no discontinuities, which can disrupt the gradient descent. The derivative of this function is simply i3k ¡ ik. Thus, at each weight step, i3k ¡ ik, multiplied by the cost parameter, is subtracted from each value. Experimentation with these cost functions indicates that they are quite effective in speeding the convergence of the gradient descent. However, when using metric gradient descent, they do not seem to do much to improve the quality of the resulting vectors. 5 Ordinal Gradient Descent (OGD) Method
This next method is based more closely on Shepard and Kruskal’s gradient descent technique for MDS (Shepard, 1962; Kruskal, 1964a). Rather than using linearly scaled target values in computing the stress, the best-tting monotonic transformation of the target values is used. Thus, the method is nonmetric, or ordinal, in that the actual target values are not important, only their rank ordering. Except where noted, all aspects of this method are identical to those for MGD, including the initialization step, the update step, and the learning-rate adjustment. 5.1 Nonmetric Stress. Ordinal gradient descent uses as its cost function the nonmetric stress measure, v r uP u i < j ( dij ¡ dOij ) 2 S¤ t SD D P , 2 T¤ i < j dij
where dOij are those values that minimize the stress, under the constraint that the dOij have the same rank order as the corresponding tij . The derivative of this function with regard to component ik is the same as for metric stress, except that b tij is replaced by dOij .
2 If city-block distances between vectors with components that are expected to fall close to ¡1 or 1 are to be compared to Hamming distances of bit vectors formed from the signs of the components, the city-block distances must actually be scaled by 1/2.
1208
Douglas L. T. Rohde
The optimal dOij values are computed on each iteration using the up-down algorithm of Kruskal (1964b). In terms of running time, this process is linear in the number of distance values, O ( N 2 ) , and is thus signicantly less costly than computing the component derivatives. 5.2 Sigmoidal Components. A major problem with the gradient descent method is the distortion introduced in discretizing the real values into bits. If the real values are either exactly ¡1 or 1, the city-block distance between real-valued vectors will exactly correspond to the Hamming distance of the bit vectors, and there will be no loss of information. However, if the real values are much less than or greater than 1 in magnitude, the discretization will introduce noise. This led to the idea of adding a cost function that encourages the real values to be close to 1 or ¡1. However, such cost functions were not found to be very helpful in practice. An alternative is to transform the vector components using a sigmoid, or logistic, function, which limits values to the range [0,1] and makes it easier for the gradient descent to achieve nearly discrete values (Rumelhart, Hinton, & Williams, 1986). The sigmoid function has the following formula:
s (ik ) D
1 . 1 C e¡gik
This function is shaped like a attened S. If ik D 0, s (ik ) D 0.5. As ik increases above 0, s ( ik ) approaches 1. As ik decreases below 0, s ( ik ) approaches ¡1. The parameter g is known as gain and controls how rapidly the sigmoid approaches its limits. The advantage of the sigmoid is that it is quite easy for the gradient descent to drive the effective vector coordinates, s (ik ) , close to 1 or 0 by driving the actual coordinates to large positive or negative magnitudes. When using the sigmoid-transformed components, the vector distance function and its partial derivative become dij D @dij @i k
X k
|s(ik ) ¡ s ( jk ) |
D g s ( i k ) (1 ¡ s ( i k ) ) sgn ( s ( i k ) ¡ s ( j k ) ).
5.3 Polarizing Cost. The sigmoid is helpful only if a good proportion of the vector components actually grow fairly large (and thus approach 0 or 1 when put through the sigmoid). In order to encourage this, an extra polarizing term is added to the cost function. In the absence of the sigmoid, such a cost function must be somewhat complex, as it should have the effect of pulling the values toward ¡1 or 1 but not beyond them. Two such functions were mentioned in section 4.6. But when the sigmoid is used, the cost function need only push the raw values away from 0. Therefore, the simple
Binary Multidimensional Scaling
1209
linear cost function is used. This has the effect of adding a small constant to the positive values and subtracting a small constant from the negative values on each weight update. In evaluating this method, a constant of 0.05 was used. Because the sigmoid avoids the problem of overly large components and the polarizing term prevents an overall shrinking of components, there is no need to renormalize the values periodically. 5.4 Running Time. Because they involve an exponential, computing the sigmoids exactly can substantially slow down this algorithm. Therefore, a fast approximation to the sigmoid function was used. The sigmoids of 1024 values evenly distributed in the range [¡16, 16] were computed in advance and stored in a table. In computing the sigmoid of a new value, the sigmoids of the two closest values in the table are linearly interpolated. This method is quite fast and is accurate to within 2 £ 10¡7 of the correct value. Despite being somewhat more complex than MGD, the asymptotic running time of this method remains H (N 2 D) per update. Because it tends to converge faster than the metric method, however, the overall running time is somewhat less. 5.5 Variants. The gain term in the sigmoid function determines how sharp the sigmoid is and thus how polarized the values are. A higher gain draws the resulting values closer to 0 or 1 and thus reduces the noise introduced in the discretization. However, a high gain can also impede learning in the minimization phase. Thus, one might think of starting with a small gain and gradually increasing it as the minimization progresses. However, attempts to do this resulted in no improvement over using a xed-gain of 1.0. Using other xed gain values also seemed to make little difference. 6 Greedy Bit-Flip (GBF) Method
This next method is quite similar to that of Clouse and Cottrell (1996), but the algorithm has been altered to achieve an asymptotically faster running time. Like the gradient descent techniques, this method performs a gradual minimization. However, rather than working in a real-valued space, it operates directly in bit space. The optimization proceeds by ipping individual bits, in an attempt to minimize the linear cost function: X |dij ¡ tij |, Cost D i< j
where dij is the Hamming distance between the bit vectors for items i and j and tij is the correlation distance between their original vectors, scaled by D, the number of bits per vector. The advantage of the linear cost function is that the contribution of individual bits in a vector to the cost is often
1210
Douglas L. T. Rohde
independent of one another, which allows the algorithm to be much more efcient. In the next method, GBFS, we consider the effect of using squared rather than absolute cost. 6.1 Initialization. The initial bit vectors are formed using a random projection, as in the gradient descent methods. However, rather than using the actual correlations with the basis vectors as the components of the initial vectors, these correlations are converted to bits based on their sign. Negative correlations become 0s, and positive correlations become 1s. This method has the property that 1s and 0s are expected equally often. 6.2 Minimization. The minimization phase is conceptually very simple. It operates by repeatedly ipping individual bits in the N £ D matrix of bit vectors, provided that those ips lead to immediate improvements in the overall cost function dened above. The bit that is ipped is always the one that leads to the greatest immediate improvement in the cost—hence, the name greedy. The Clouse and Cottrell (1996) algorithm differed in that it selected bits at random and then tested to see if ipping the selected bit would decrease the cost. This is fairly inefcient near the end of the minimization process, where there are very few bits worth ipping. The key to performing the minimization quickly is to keep track at all times of the change in overall cost that would result from ipping each bit. This is referred to as the bit’s gain. A positive gain means the overall cost would be reduced by changing the bit. All bits with positive gains are stored in an implicitheap (Williams, 1964). This standard priority queue data structure allows the bit with the highest gain to be accessed in constant time. Whenever the gain of a bit is changed (because it or another bit is ipped), the heap must be adjusted. In theory, this adjustment takes O (log( ND ) ) time. However, the O (log( ND ) ) bound is very loose. Because the gain changes tend to be small, heap updates rarely involve more than a few steps through the heap. Furthermore, because only bits with positive gains are maintained in the heap and the vast majority of bits have negative gains, especially toward the end of the process, there are usually far fewer than ND bits actually in the heap. Therefore, the gain update step is, in practice, quite close to a constant-time operation. Along with the gain heap, we also maintain the target distance, tij , and the current distance, dij , for each pair of vectors. If the actual distance is at least 1 less than the target distance, we would like to make the two vectors more different. Therefore, for any dimension k, if bits ik and jk are the same, the overall cost would be reduced by 1 if we ipped either of those bits. If ik and jk are different, the cost would increase by 1 if we ipped either of those bits. If the actual distance is at least 1 larger than the target distance, the contribution of these two vectors to the overall cost will be reduced by 1 if
Binary Multidimensional Scaling
1211
we ip any bit that makes them more similar and increase by 1 if we ip any bit that makes them more different. If ipping a bit causes dij to change from larger than tij to smaller than tij , the change in cost will be 1 ¡2(dij ¡tij ). If dij grows larger than tij with a bit ip, the change in cost is 1 ¡ 2(tij ¡ dij ). Of course, the gain for ipping a particular bit for item i is not dependent on just one other item, j. It is summed over all other items. When any bit is ipped, the gains for some other bits must be adjusted. One factor that improves the efciency of the GBF algorithm is that we do not need to update the gain for every other bit. As long as we are using the linear cost model, the gain for most other bits remains unchanged. If the bit ik has just been ipped, we will denitely need to update the gain for the other bits in vector i. We will also need to update bit k for all of the other vectors. However, we do not necessarily need to update the other bits for the other vectors. We need to do so only if |tij ¡ dij | < 1, either before or after the ip. 6.3 Running Time. Each of the bit updates takes constant time (assuming a constant heap update), so the cost of ipping a bit is somewhere between H ( N C D ) and H (ND ), depending on how many other bits must be updated. In practice, all the bits of a vector are usually updated roughly one-third of the time. However, most of these cases occur toward the end of the minimization as the actual distances grow close to the target distances. Therefore, the bulk of the minimization occurs in a relatively short time, and if time were a factor, the minimization could be stopped well before it is complete without signicant degradation in the resulting vectors. Unfortunately, as with the gradient descent methods, it is not possible to predict exactly how long the minimization process will take. In theory, there could be an exponentially large number of ips before the algorithm terminates. One could, of course, terminate early once a minimum gain, a minimum number of bits in the gain heap, or a time limit has been reached. Or one could test the bit vectors periodically and stop when signicant further progress seems unlikely. However, in practice, the algorithm tends to terminate on its own in a consistent number of ips, varying by at most a few percent between trials. 6.4 Variants. One concern in performing greedy minimization—always ipping the bit that provides the greatest immediate gain—is that the algorithm may be more likely to fall into bad local minima. A better option may be to select random bits from the gain heap as did Clouse and Cottrell (1996), effectively. In practice, performing random optimization can lead to very slightly better solutions than greedy optimization on most, but not all, tasks. However, it tends to require about twice as many ips because they are, on average, less effective. Another possibility is to start with a completely random initial conguration rather than one produced by the random projection technique. Again,
1212
Douglas L. T. Rohde
an optimization starting from a random initial conguration tends to take about twice as long. When D is large, there is little or no difference in performance. Interestingly, when D is small, starting from a random initial conguration leads to signicantly better results on the Exemplar task but much worse results on the Word task. An additional thought is that rather than ipping bits one at a time, several bits could be ipped at once. This would make the algorithm more like a discrete version of the gradient descent methods, in which all vector components are updated simultaneously. Perhaps ipping several bits at once would add noise that could help propel the minimization past local minima. Several variants of this idea were tested. In the rst, a random subset of bits in the gain heap was ipped simultaneously. Each bit in the heap was chosen with a probability that ranged from 5% to 25% across trials. Once there were fewer than about 10 bits in the heap, it was necessary to revert to the single-bit method or the minimization would never terminate. A second variant ipped the n bits with highest gain, where n was a specied fraction of the total number of bits with positive gain. Unfortunately, these methods produced very similar results to the single-bit method but were somewhat slower. 7 Greedy Bit-Flip with Squared Cost (GBFS) Method
GBFS is identical to GBF, except that a squared cost function is used, as recommended by Clouse and Cottrell (1996). This provides a greater penalty for vector pairs that are very far from their target distances. The disadvantage of the squared cost function is that we cannot assume independence when updating bit gains. Thus, when a bit is ipped, we must update the gains for all ND bits, slowing the algorithm considerably. If one wishes to run the algorithm until a local minimum is reached, this method will be more efcient than the Clouse and Cottrell (1996) technique because the latter suffers from inefciency when few bad bits remain. However, if the algorithm is to be terminated well short of convergence, the Clouse and Cottrell (1996) method will most likely be faster because it avoids the overhead of maintaining the gain heap. 8 Greedy Max Cut (GMC) Method
The nal algorithm, known as the greedy max cut (GMC) method, also operates primarily in bit space. It starts with an empty N £ D matrix of bit vectors and iterates through the columns of bits, choosing the value of the rst bit for each vector, then the second bit for each vector, and so on. Once all of the bits have been chosen, it iterates through the columns of the matrix again several times, adjusting the bits where necessary.
Binary Multidimensional Scaling
1213
Aside from the difference in initialization, this algorithm can be distinguished from GBF and GBFS, and the earlier method of Clouse and Cottrell (1996), primarily in how it identies bits that need to be ipped. Rather than selecting bits greedily or at random, the GMC algorithm systematically tests each bit in the matrix. This requires simpler record keeping than GBF, as we no longer need to maintain the gain of each bit. Like the Clouse and Cottrell method, GMC maintains only the current Hamming distances and target distances between all pairs of vectors. As in GBFS, the cost function minimized in this algorithm is the sumsquared difference between the actual Hamming distances, dij , and the correlation distances of the original vectors, scaled by D: X ( dij ¡ tij ) 2 . Cost D C D i< j
8.1 Filling the Columns. At all times, the algorithm maintains the current set of Hamming distances between vector pairs. It begins with vectors of all 0s. It then cycles through the D columns in the bit-vector matrix, lling each column so as to minimize the overall cost. This is called the “greedy max cut” method because the process of lling the columns chooses each bit so as to greedily reduce the overall cost and is related to the well-known maximum cut graph partitioning problem. Consider a complete, undirected graph with N vertices, corresponding to the N items, with the weight of edge ij equal to 1 ¡ 2(tij ¡ dij ) . The problem of nding the assignment of bits to column k that minimizes the squared cost is equivalent to nding the partitioning of the graph that maximizes the weight on edges crossing the partition. The items on the same side of the partition receive the same value for bit k. This is the maximum cut problem, which is known to be NP-complete (Karp, 1972). 3 Therefore, it seems likely that no algorithm exists to produce an optimal solution to the problem in polynomial time. However, several fast approximation algorithms are known that will produce solutions to the maximum cut problem guaranteed to be within a certain percentage of the optimal value. The earliest such approximation algorithm is that of Sahni and Gonzales (1976), which guaranteed that the solution found would be at least half of the optimal value. The method employed here, in its rst pass, is quite similar to the Sahni and Gonzales algorithm. When lling column k, the rst item receives a random bit. Each subsequent item is given the bit value that results in a lower overall cost, computed over the preceding items. The contribution to the overall cost of item i that 3 Technically, our bit assignment problem does not exactly reduce to maximum cut because the latter normally permits only positive edge weights, whereas our graph has both positive and negative weights. Although this form of reduction does not prove that the bit assignment problem is NP-hard, it is suggestive of the difculty of the problem.
1214
Douglas L. T. Rohde
results from selecting the value ik D 0 will be X X ( dij ¡ tij ) 2 C ( dij C 1 ¡ tij ) 2 , Ci0k D j < i, jk D 0
j < i, jk D1
where j iterates over all items for which bit k has been chosen. The rst summation includes only those items for which bit jk is 0 and the second summation includes only those items for which bit jk is 1. Thus, if ik D 0, the distance to items for which jk D 1 will increase by one. Similarly, the cost for choosing ik D 1 will be C1ik D
X
j < i, jk D 0
( dij C 1 ¡ tij ) 2 C
X
j < i, jk D1
( dij ¡ tij ) 2 .
If Ci0k < C1ik , ik is set to 0, otherwise to 1. In practice, it would be inefcient to compute those entire expressions. It is better just to compute their difference, which is given by Ci0k ¡ C1ik D
X j
(2 jk ¡ 1) (2( dij ¡ tij ) C 1).
8.2 Adjusting the Columns. This single-pass assignment of the bits in a column can achieve only a rough approximation to the optimal partitioning. It can be further rened by iterating through the items and ipping their bits when doing so results in lowered cost. In this adjustment phase, the bits are not cleared in advance, and the cost function for item i is computed over all other items, not just the preceding items. Thus, the value of each bit in the column is reconsidered given the values of the other bits. Subsequent adjustments will rene the assignment further, but the number of bits that are changed each time gradually decreases. It is useful to perform at least two of these primary adjustments to each column before lling the next column. Once all of the columns are lled, it is helpful to cycle through them several more times, readjusting the bits in each one when the cost improves. These are known as secondary adjustments. To clarify, the primary adjustments occur to each column before the next column is lled. The secondary adjustments occur once all of the columns have been lled. In evaluating this algorithm, two primary adjustments and eight secondary adjustments were used. 8.3 Running Time. Unlike the gradient descent or bit-ipping methods, the running time of this algorithm is easily predicted. Filling or adjusting each column requires H ( N 2 ) operations. If the total number of primary and secondary adjustments is a, the running time of the algorithm is H ( N 2 D (a C 1) ).
Binary Multidimensional Scaling
1215
8.4 Variants. The obvious parameters affecting this method are the number of primary and secondary adjustments. The rst few adjustments result in signicant improvement, but further adjustments have greatly diminishing returns. There is a trade-off in balancing the number of primary and secondary adjustments. One could rely on all-primary or all-secondary adjustments, but performance is better with some of each. Holding the total number of adjustments xed, it is generally best to use two or three primary adjustments and a greater number of secondary ones. As described, the GMC algorithm is completely deterministic. Multiple runs on the same set of items will result in effectively the same vectors, although some columns will have all of their bits reversed because the rst bit in each column was selected randomly. It seems plausible that this method could tend to hit local minima because the bits are always updated in the same order in the adjustment phases. A reasonable variant would be to adjust the columns in randomly permuted order. However, experiments with this on the Word task found equivalent or slightly worse performance than the simpler deterministic method. Another possibility is to alter the order of traversal in the secondary adjustment phase. Rather than traversing the columns, one might traverse the rows. This has the effect of adjusting a single point relative to all of the other points before moving the next point. In contrast, column traversal adjusts all points along a single dimension before considering the next dimension. One might expect these two adjustment methods to produce different results, but in practice, they seem to result in virtually identical performance. An alternative is to select bits for possible adjustment at random, as in the Clouse and Cottrell (1996) algorithm. Again, one might expect this to avoid local minima better. However, equating for the number of bits tested, random ipping has proved to be slightly, but not much, worse than either of the systematic updating methods. Thus, in its secondary phase, GMC does not differ signicantly from the earlier method. The most important difference between the algorithms is the method of initializing the bit matrix. Replacing the primary bit-assignment phase of GMC with a random assignment of bits, but still equating for the total number of bits tested, results in signicantly worse performance. Replacing it with a random projection, as in GBF, also degrades performance, but less so than random initialization. This algorithm can be easily adapted to any cost function that has the form of a sum over independent contributions from each item pair. It could efciently handle the stress measure if the numerator and denominator were stored and updated incrementally. It would be more difcult to adapt the algorithm to a nonmetric cost function. One would need a fast, incremental version of the up-down algorithm to maintain the optimal monotonic transformation of the vector distances as bits are ipped. When amortized over multiple ips, this may still be more than a constant time operation and would thus add to the asymptotic complexity.
1216
Douglas L. T. Rohde
Finally, as in the GBF and GBFS methods, it is easy to add a term to the cost function that severely penalizes duplicate vectors and thus ensures that each item has a unique representation, which is sometimes necessary. 9 Performance Analysis
In this section, we present comparisons of the six implemented algorithms: SVD, MGD, OGD, GBF, GBFS, and GMC. The methods were applied to the Exemplar and Word tasks with varying bit-vector dimensionalities, D, and were evaluated using three measures of the agreement between the original pairwise distances and the nal bit-vector Hamming distances: goodness, metric stress, and nonmetric stress. The average running times of the algorithms were also measured and are reported in section 9.4. Three trials of each condition were run on a 500 MHz 21264 Alpha processor and the results averaged. The only exceptions were the few trials of SVD and GBFS that lasted more than 10 hours, which were run only once. In general, the results of the methods were all very consistent across trials. The SVD and GMC algorithms are deterministic and thus achieve the same results every time. The other algorithms achieve goodness or stress ratings that vary about 1% between trials for small values of D and less than 0.2% for D > D 100. Because of the small variance, limited number of trials, and general clutter of the gures, error bars are not shown. 9.1 The Exemplar Task. The average goodness ratings of the six algorithms on the Exemplar task, as a function of D, are shown in Figure 1. With the exception of SVD, the goodness increases monotonically with the size of the vectors. SVD is clearly the worst of the methods. Interestingly, in that case, the goodness actually decreases with D values over 20. This is presumably because, with larger D, the SVD method begins to rely on less important singular vectors, which convey little useful information. There is no clear winner between OGD and MGD. OGD may be better for lower D but worse for higher D. The three best algorithms, according to goodness, are the bit-space optimizing methods: GBF, GBFS, and GMC. GBF does not do as well for small D. With the exception of D D 10, GMC achieves the best goodness in every case, although the differences with GBF and GBFS are inconsequential for large D. The metric and nonmetric stress on the Exemplar task are shown in Figures 2 and 3, respectively. SVD does so poorly according to metric stress (from 18.6 for D D 10 to 0.34 for D D 200) that it does not appear on the graph. SVD also does quite poorly according to nonmetric stress, although it is at least comparable to the other methods. Interestingly, although it did not according to goodness, the performance of SVD monotonically improves with D according to the stress measures. MGD is mediocre according to both measures. OGD does rather poorly according to metric stress. This should not be too surprising since OGD is
Binary Multidimensional Scaling
1217
1.00
0.95
0.90
Goodness
0.85
0.80
0.75 SVD MGD OGD GBF GBFS GMC
0.70
0.65
0.60
10
15
20 30 40 50 75 100 Dimensionality of Bit Vectors (D)
150
200
Figure 1: Goodness on the Exemplar task.
only optimizing nonmetric stress and its resulting pairwise distances are unlikely to match the target distances directly. However, OGD is still not as good as the bit-space methods even on nonmetric stress. GBFS and GMC are fairly indistinguishable according to either measure, although, with the exception of D D 10, GMC is slightly better in all cases. GBF is worse than the other two for D D 10 but is nearly as good at higher dimensionality. 9.2 The Word Task. The Word task has a much more complex similarity structure than the articial Exemplar task and thus may provide a better measure of the ability of the BMDS methods on other scaling problems involving natural data. Because it has 5000 items and the original vectors have 4000 dimensions rather than 1000, the Word task is computationally harder as well. The goodness measure is shown in Figure 4, and the stress measures are depicted in Figures 5 and 6. Once again, SVD does quite poorly. It makes little improvement in goodness with higher dimensionality. It is off the metric stress scale, except for
1218
Douglas L. T. Rohde 0.25 SVD MGD OGD GBF GBFS GMC
Metric Stress
0.20
0.15
0.10
0.05
0.00
10
15
20 30 40 50 75 100 Dimensionality of Bit Vectors (D)
150
200
Figure 2: Metric stress on the Exemplar task.
the case of D D 200, and its nonmetric stress is much worse than that of the other measures. Again, however, it does make steady improvement with higher D according to the stress measures but not according to goodness. OGD is the clear winner on the goodness scale, especially for small D. This is surprising since it performed quite poorly on the Exemplar task. According to metric stress, OGD again does not do very well, but it is the best measure by nonmetric stress for D < D 50. Nevertheless, based on nonmetric stress, it is not clearly dominant over the other measures as it is on the goodness scale. The next section addresses this apparent disparity between goodness and the stress measures. As measured by goodness, MGD does quite well but is not close to the performance of OGD. According to stress, MGD is just average. GMC and GBFS have quite similar performance, but GMC is consistently a little bit better on all three measures at all values of D. GBF is reasonably good at high values of D, but its performance on all measures drops off with small D.
Binary Multidimensional Scaling
1219
0.25 SVD MGD OGD GBF GBFS GMC
Ordinal Stress
0.20
0.15
0.10
0.05
0.00
10
15
20 30 40 50 75 100 Dimensionality of Bit Vectors (D)
150
200
Figure 3: Nonmetric stress on the Exemplar task.
9.3 Stress versus Goodness. The stress and goodness measures do not always agree. At times, one method will perform better than another according to goodness but worse according to stress. For example, on the Word task with 50 bits, OGD achieves an average goodness of 0.843, while the GMC vectors have a goodness of only 0.741. However, the GMC vectors have a metric stress of 0.109, which is better than the 0.133 of the OGD vectors. According to nonmetric stress, GMC and OGD are very similar (0.104 versus 0.102). Because the measures do not always agree, it is important to choose an evaluation measure that is appropriate for a given task. So let us briey compare the properties of these measures. We begin by trying to understand what aspects of the GMC and OGD solutions might have led to the disagreement among evaluation measures. One common way to evaluate MDS results visually is through the use of a Shepard diagram, a scatter plot with one point for every pair of items. The horizontal axis represents the distance between the original pair of vectors
1220
Douglas L. T. Rohde 1.0
0.9
0.8 SVD MGD OGD GBF GBFS GMC
Goodness
0.7
0.6
0.5
0.4
0.3
0.2
10
15
20 30 40 50 75 100 Dimensionality of Bit Vectors (D)
150
200
Figure 4: Goodness on the Word task.
(or subjects’ similarity ratings if that is the starting point). The vertical axis represents the actual distance between the vectors in the reduced space. Ideally, the points should fall on the identity line. However, because the Word task involves 5000 items, a standard Shepard plot would contain 12.5 million points. So many points are hard to handle, and estimating their density is difcult. Therefore, the modied version of a Shepard plot used here is something like a 2D histogram. The graph is partitioned into a grid of cells, and the number of points falling into each cell is counted. Then the columns of cells are normalized so that the values in each column sum to 1. Each cell is then plotted as a circle whose area is proportional to the normalized cell value. The ideal graph will be linear and have little vertical spread. These “Shepard histograms” for one run of the GMC and OGD algorithms are shown in Figures 7 and 8. This method of normalizing columns of cells disguises the fact that the vast majority of pairs have target distances close to 25, indicating that their
Binary Multidimensional Scaling
1221
0.30 SVD MGD OGD GBF GBFS GMC
0.25
Metric Stress
0.20
0.15
0.10
0.05
0.00
10
15
20 30 40 50 75 100 Dimensionality of Bit Vectors (D)
150
200
Figure 5: Metric stress on the Word task.
original vectors were uncorrelated. Therefore, most of the data points in the graph actually fall in this middle range. This can be seen by performing the normalization across all cells, not just along the individual columns. Figure 9 displays the same data as Figure 8, but with normalization across all cells. The cells representing very small or large correlation distances are barely visible because they contain so few points. Figure 7 shows that the GMC algorithm nds a fairly linear solution. However, the solution appears to have a negative zero crossing and is somewhat warped for small target distances. Thus, vectors that were originally quite similar tend to be even more similar in the bit space. There is also greater variance for the columns on the left, although this partially results from the fact that they contain very few points. The plot of the OGD solution in Figure 8 is noticeably less linear, having a more pronounced sigmoidal shape. Like the GMC solution, small distances tend to be too small in the nal space. However, in this case, large distances
1222
Douglas L. T. Rohde 0.30 SVD MGD OGD GBF GBFS GMC
0.25
Ordinal Stress
0.20
0.15
0.10
0.05
0.00
10
15
20 30 40 50 75 100 Dimensionality of Bit Vectors (D)
150
200
Figure 6: Nonmetric stress on the Word task.
tend to be exaggerated as well. This partially explains why OGD does better according to goodness, or correlation, worse according to metric stress, and almost equivalently according to nonmetric stress. Nonmetric stress is essentially indicative of the variance of the columns in the Shepard diagram but is insensitive to the mean value in each column. In this case, the two methods have fairly similar variance, resulting in similar nonmetric stress. Metric stress, on the other hand, measures the disparity between the points and the identity function. Any deviation from this line, whether for large or small target distances, contributes equally to the stress. Metric stress is sensitive to both noise and monotonic distortions, the latter having a relatively strong effect. Thus, the OGD solution has a higher metric stress. The correlation measure is a bit harder to characterize analytically. But some simple empirical tests involving a few articial data sets show that correlation is actually fairly insensitive to monotonic distortions. First, a set
Binary Multidimensional Scaling
1223
45 40
Actual Hamming Distance
35 30 25 20 15 10 5 0 0
5
10
15
20
25
30
35
Target Correlation Distance
Figure 7: Distribution of pairwise bit vector Hamming distances versus original distances for a run of the GMC method on the Word task with 50 bits. Cells are normalized by columns.
of random numbers evenly distributed between 0 and 1 was generated. A second set of numbers was produced by transforming each value in the rst set, and the correlation and stress of the number pairs was measured. Monotonic distortions have a greater effect on stress. If the second set is composed of the square roots of the rst set, there is little effect on the correlation, which is 0.98, but the stress rises to 0.098. If cube roots are used, the correlation is still fairly high, 0.959, but the stress is 0.229. Finally, if a sigmoid centered around 0.5 with a gain of 2 is applied to the numbers to create an S-shaped transformation as in Figure 8, the stress is moderate, 0.067, but the correlation is virtually unchanged at 0.9998. In contrast, adding noise to the values has a larger effect on correlation than on stress. If 0.15 is either added to or subtracted from each value with equal chance, the correlation drops to 0.887, but the stress is again 0.067. Thus, in this case of noise, the stress is equal to or lower than it was with the
1224
Douglas L. T. Rohde 45 40
Actual Hamming Distance
35 30 25 20 15 10 5 0 0
5
10
15
20
25
30
35
Target Correlation Distance
Figure 8: Distribution of pairwise bit vector Hamming distances versus original distances for a run of the OGD method on the Word task with 50 bits. Cells are normalized by columns.
monotonic transformations, but the correlation is much worse. The implication of this is that the correlation measure is fairly ordinal in its behavior. Indeed, measuring the correlation on the example tasks using monotonically transformed targets, dOij , rather than the actual targets, tij , generally results in only a slight improvement. 9.4 Running Time. Finally, we turn to the issue of the running time of the various algorithms. Regardless of how good the results may be, an algorithm is useful only if it can solve a given task in an acceptable time frame. The running times of SVD and GMC are fairly easy to analyze because they are deterministic. The method used here to compute the SVD is H ( N 2 ( N C M ) C ND ) . The ND term is for assigning the bits and is inconsequential. If M < N, the matrix is inverted, resulting in a H ( M2 ( M C N ) ) algorithm.
Binary Multidimensional Scaling
1225
45 40
Actual Hamming Distance
35 30 25 20 15 10 5 0 0
5
10
15
20
25
30
35
Target Correlation Distance
Figure 9: The same data as in Figure 8 normalized across all cells rather than by columns.
The other algorithms require that all pairwise distances between the vectors are computed, which is a H ( N 2 M ) process. However, because that step is common to all of them, it was done in advance and the distances stored. It is therefore not shown in the measured running times. It is worth noting that although they are both H (N 2 M ), computing the vector distances is much quicker than computing the SVD due to the much improved constant. Following the distance computation, the GMC algorithm is H ( N2 D ), assuming the number of adjustments is treated as a constant. The gradient descent and bit-ipping algorithms, on the other hand, are difcult to analyze because it is not clear when they will terminate. They are suspected to be roughly H ( N2 D ), however, and we turn to some empirical measures to verify this. Figures 10 and 11 show the scaled running times of the methods as a function of D on the Exemplar and Word tasks. Because the times were
1226
Douglas L. T. Rohde 200 SVD MGD OGD GBF GBFS GMC
Running Time, Divided by D
150
100
50
0
10
15
20 30 40 50 75 100 Dimensionality of Bit Vectors (D)
150
200
Figure 10: Running time as a function of D on the Exemplar task (scaled by 1 / D).
expected to be roughly linear in D, they were all divided by D in producing the graph. Therefore, a at line indicates a truly linear algorithm. SVD, because its running time is essentially constant, has decreasing curves in both graphs. SVD was so slow on the Word task, however, that it appears on the graph only for D D 200. MGD and OGD seem to be fairly linear in D. Both are a bit slower for very small D on the Exemplar task, possibly because they had difculty settling on a good solution with so few bits. OGD was signicantly quicker on the Word task but slower on the Exemplar task. The GBF method appears to be somewhat worse than linear, its curves noticeably rising on the right side. GBFS, on the other hand, is essentially quadratic in D. The algorithm that is consistently fastest, other than the ineffective SVD method, is GMC. On the Word task with 200 bits per vector, GMC is almost three and a half times faster than the next fastest method, GBF. Al-
Binary Multidimensional Scaling
1227
500 SVD MGD OGD GBF GBFS GMC
Running Time, Divided by D
400
300
200
100
0
10
15
20 30 40 50 75 100 Dimensionality of Bit Vectors (D)
150
200
Figure 11: Running time as a function of D on the Word task (scaled by 1 / D).
though GMC is known to be truly linear in D, its scaled running times actually decrease with larger D. This reects the fact that lower-order terms, such as the N 2 cost of loading the pairwise distances, have a relatively diminishing effect on the overall time. This indicates that the other methods that appeared to be linear due to at lines may actually be slightly worse. Figure 12 depicts the running times of the methods for varying numbers of items, N, on the Word task. In this case, the running times have been divided by N 2 . GMC is known to be quadratic in N. Therefore, the slight rise in its line is due to either lower-order terms or caching inefciencies resulting from the increased memory requirements of the larger problems. GBF has a similar prole and is thus nearly quadratic in N as well. GBFS and MGD, on the other hand, are clearly worse than N 2 . OGD may be slightly worse than quadratic, but it is not clear. SVD ranged from 4.3 times slower than MGD for N D 500 to 9.5 times slower for N D 5000.
1228
Douglas L. T. Rohde -4
4´10
SVD MGD OGD GBF GBFS GMC
Running Time, Divided by N^2
-4
3´10
-4
2´10
-4
1´10
0
500
1000 2000 Number of Vectors (N)
5000
Figure 12: Running time as a function of N on the Word task (scaled by 1 / N 2 ).
10 Discussion
Although multidimensional scaling techniques have been studied for over half a century, binary multidimensional scaling, which was inspired by the need to develop representations usable in training neural networks, is a relatively new, and intriguing, problem. This study has introduced and evaluated several algorithms for performing binary multidimensional scaling. It is hoped that the better methods will prove useful to researchers in their current forms and that the discussion of the less effective methods in this article will help to direct future attempts to improve on these techniques. 10.1 SVD. Although it is so useful in other types of scaling problems, the SVD method is not a good choice for BMDS. It consistently achieved the worst performance. For the Word task, this came at the greatest cost in terms of running time. Although it is possible that improved discretization
Binary Multidimensional Scaling
1229
methods could achieve better BMDS performance using the SVD, there is still the issue of the running time. Unless either the number or dimensionality of the original vectors is quite small, simply computing the SVD is prohibitively expensive. 10.2 MGD and OGD. The gradient descent methods, which are the most closely related to techniques currently in use in standard MDS, show some promise for BMDS. They have the advantage that they can be used with any differentiable cost function and are thus extremely exible. Although they were slower than GBF and GMC on these tasks, most of the improvement in their results occurs early in the gradient descent and the process can be cut short with relatively little effect on performance. On the Exemplar task, OGD and MGD were not as good as the bit-space methods by any measure. However, they performed very well according to the goodness measure on the Word task, especially OGD. OGD was also the best according to nonmetric stress for small D on the Word task. Unless one is concerned with using a nonmetric method, OGD seems to be a better choice than MGD. It generally achieves superior performance and also converges more quickly. This is partially due to the fact that it is nonmetric and partially due to the use of sigmoidally transformed values in computing the vector distances. It is true, however, that time to convergence of MGD could be reduced with the use of a polarizing cost function. Innumerable variants of these methods are possible, and it is likely both could be improved with further work. 10.3 GBF and GBFS. The GBFS method is essentially the one used by Clouse and Cottrell (1996), although the current implementation begins with a random projection and is operated in a greedy fashion rather than by ipping random bits with positive gain. GBFS consistently produces very good solutions. Unfortunately, it suffers from being quadratic in D and more than quadratic in N. Because it uses a linear cost function, the GBF method is able to cut corners and run much more quickly, with only a modest loss in performance on the Exemplar task. On the Word task, GBF does not do as well and has particular trouble with small D. Nevertheless, both GBF and GBFS appear to be strictly worse than GMC, in terms of running time as well as performance. Although the speed of both methods could be improved by terminating the optimization early, this would hurt performance. 10.4 GMC. The GMC algorithm seems to be currently the best overall choice for BMDS. It is the fastest of the algorithms and produces the best or nearly the best results according to the stress measures and also achieved the best goodness scores on the Exemplar task. But it should be noted that OGD did achieve better goodness ratings on the Word task, and thus may
1230
Douglas L. T. Rohde
be preferable in cases where nonmetric solutions are acceptable and the similarity structure is relatively complex. The GMC algorithm has a number of other advantages. Its running time is well understood. Unlike the gradient descent and bit-ipping methods, GMC runs for a consistent and predictable amount of time. As with GBF and GBFS, but unlike the gradient descent methods, GMC can be modied to produce only unique vectors by adding a term to the cost function. This may be a requirement for some applications of BMDS. For example, it could be problematic if two different words are assigned exactly the same meaning. GMC can also be used with a variety of cost functions, although it is not quite as exible as the gradient descent methods in this regard because the cost must be incrementally calculable. Finally, GMC is very easy to implement. Unlike the gradient descent methods, there are no learning rates or other parameters to adjust or complex derivatives to compute. Unlike the bit-ipping methods, the algorithm is simple and straightforward with minimal record keeping. 10.5 Conclusion. With the exception of SVD and possibly GBFS, the binary multidimensional scaling methods presented here are capable of handling problems of reasonably high complexity. However, even GMC, with a running time of H ( N 2 ( M C D ) ), will not scale up well to problems with hundreds of thousands of items or dimensions. To solve such large problems, more efcient, though perhaps less effective, techniques will be needed. One possibility is to use a limited set of R reference items. All items are positioned relative to those in the reference set but not necessarily relative to one another. If the dimensionality of the nal space is not too large, the reference vectors may sufciently constrain the positions of the other items relative to one another to produce a good overall solution. Variants of this idea are possible with all of the algorithms presented here, although not always conjointly with the uniqueness constraint offered by the bit-space methods. Code for any of these methods can be obtained by contacting the author. Acknowledgments
This research was supported by NIMH NRSA Grant 5-T32-MH19983 for the study of Computational and Behavioral Approaches to Cognition. Special thanks to David Plaut for his comments and guidance, Santosh Vempala for his brilliance, and Daniel Clouse and Stephen J. Hanson for their helpful comments. References Beatty, M., & Manjunath, B. S. (1997). Dimensionality reduction using multidimensional scaling for image search. In Proceedings of the IEEE International Conference on Image Processing (Vol. 2, pp. 835–838). Santa Barbara, CA.
Binary Multidimensional Scaling
1231
Berry, M. W., Dumais, S. T., & O’Brien, G. W. (1994). Using linear algebra for intelligent retrieval (Tech. Rep. No. CS-94-270). Knoxville, TN: University of Tennessee. Borg, I., & Groenen, P. (1997). Modern multidimensional scaling. New York: Springer-Verlag. Burgess, C. (1998). From simple associations to the building block of language: Modeling meaning in memory with the HAL model. Behavior Research Methods, Instruments, and Computers, 30, 188–198. Clouse, D. S. (1998). Representing lexical semantics with context vectors and modeling lexical access with attractor networks. Unpublished doctoral dissertation, University of California, San Diego. Clouse, D. S., & Cottrell, G. W. (1996). Discrete multi-dimensional scaling. In Proceedings of the 18th Annual Conference of the Cognitive Science Society (pp. 290–294). Mahwah, NJ: Erlbaum. Deerwester, S. C., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. A. (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41, 391–407. Deza, M. M., & Laurent, M. (1997). Geometry of cuts and metrics. Berlin: SpringerVerlag. Frieze, A., Kannan, R., & Vempala, S. (1998). Fast Monte-Carlo algorithms for nding low-rank approximations. In Proceedings of the 39th IEEE Symposium on Foundations of Computer Science. Los Alamitos, CA: IEEE Computer Society Press. Karp, R. M. (1972). Reducibility among combinatorial problems. In R. Miller and J. Thatcher (Eds.), Complexity of computer computations (pp. 85–103). New York: Plenum Press. Kruskal, J. B. (1964a). Multidimensional scaling by optimizing goodness of t to a nonmetric hypothesis. Psychometrika, 29, 1–27. Kruskal, J. B. (1964b). Nonmetric multidimensional scaling: A numerical method. Psychometrika, 29, 115–129. Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, and Computers, 28, 203–208. Pearlmutter, B. A. (1989). Learning state space trajectories in recurrent neural networks. Neural Computation, 1, 263–269. Plaut, D. C., & Shallice, T. (1993). Deep dyslexia: A case study of connectionist neuropsychology. Cognitive Neuropsychology, 10, 377–500. Richardson, M. W. (1938). Multidimensional psychophysics. Psychological Bulletin, 35, 659. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart, J. L. McClelland, & PDP Research Group (Eds.), Parallel distributed processing:Explorations in the microstructure of cognition. Volume 1: Foundations (pp. 318–362). Cambridge, MA: MIT Press. Sahni, S., & Gonzales, T. (1976). P-complete approximation problems. Journal of the ACM, 23, 555–565. Shepard, R. N. (1962). The analysis of proximities: Multidimensional scaling with an unknown distance function. Psychometrika, 27, 125–139, 219–246.
1232
Douglas L. T. Rohde
Shepard, R. N. (1966). Metric structures in ordinal data. Journal of Mathematical Psychology, 3, 287–315. Shepard, R. N., Romney, A. K., & Nerlove, S. B. (1972). Multidimensional Scaling, Volume 1: Theory. New York: Seminar Press. Torgerson, W. S. (1952). Multidimensional scaling: I. Theory and method. Psychometrika, 17, 401–419. Torgerson, W. S. (1965). Multidimensional scaling of similarity. Psychometrika, 30, 379–393. Williams, J. W. J. (1964). Algorithm 232: Heapsort. Communications of the ACM, 7, 347–348. Young, G., & Householder, A. S. (1938). Discussion of a set of points in terms of their mutual distances. Psychometrika, 3, 19–22. Received November 30, 2000; accepted September 5, 2001.
ARTICLE
Communicated by Philip Sabes
Cosine Tuning Minimizes Motor Errors Emanuel Todorov
[email protected] Gatsby Computational Neuroscience Unit, University College London, London, U.K.
Cosine tuning is ubiquitous in the motor system, yet a satisfying explanation of its origin is lacking. Here we argue that cosine tuning minimizes expected errors in force production, which makes it a natural choice for activating muscles and neurons in the nal stages of motor processing. Our results are based on the empirically observed scaling of neuromotor noise, whose standard deviation is a linear function of the mean. Such scaling predicts a reduction of net force errors when redundant actuators pull in the same direction. We conrm this prediction by comparing forces produced with one versus two hands and generalize it across directions. Under the resulting neuromotor noise model, we prove that the optimal activation prole is a (possibly truncated) cosine—for arbitrary dimensionality of the workspace, distribution of force directions, correlated or uncorrelated noise, with or without a separate cocontraction command. The model predicts a negative force bias, truncated cosine tuning at low muscle cocontraction levels, and misalignment of preferred directions and lines of action for nonuniform muscle distributions. All predictions are supported by experimental data. 1 Introduction Neurons are commonly characterized by their tuning curves, which describe the average ring rate f ( x) as a function of some externally dened variable x. The question of what constitutes an optimal tuning curve for a population code (Hinton, McClelland, & Rumelhart, 1986) has attracted considerable attention. In the motor system, cosine tuning has been well established for motor cortical cells (Georgopoulos, Kalaska, Caminiti, & Massey, 1982; Kettner, Schwartz, & Georgopoulos, 1988; Kalaska, Cohen, Hyde, & Prud’homme, 1989; Caminiti, Johnson, Galli, Ferraina, & Burnod, 1991) as well as individual muscles (Turner, Owens, & Anderson, 1995; Herrmann & Flanders, 1998; Hoffman & Strick, 1999). 1 The robustness of cosine 1 When a subject exerts isometric force or produces movements, each cell and muscle is maximally active for a particular direction of force or movement (called preferred direction), and its activity falls off with the cosine of the angle between the preferred and actual direction.
c 2002 Massachusetts Institute of Technology Neural Computation 14, 1233–1260 (2002) °
1234
Emanuel Todorov
tuning suggests that it must be optimal in some meaningful sense, yet a satisfactory explanation is lacking. In this article, we argue that cosine tuning in the motor system is indeed optimal in the most meaningful sense one can imagine: it minimizes the net effect of neuromotor noise, resulting in minimal motor errors. The argument developed here is specic to the motor system. Since it deviates from previous analyses of optimal tuning, we begin by clarifying the main differences. 1.1 Alternative Approaches to Optimal Tuning. The usual approach (Hinton et al., 1986; Snippe, 1996; Zhang & Sejnowski, 1999b; Pouget, Deneve, Ducom, & Latham, 1999) is to equate the goodness of a tuning function f with how accurately the variable x can be reconstructed from a population of noisy responses m 1 C e1 , . . . , m n C en , where m i D f ( x ¡ ci ) is the mean response of neuron i with receptive eld center ci . This approach to the analysis of empirically observed tuning is mathematically appealing but involves hard-to-justify assumptions: In the absence of data on higher-order correlations and in the interest of analytical tractability, oversimplied noise models have to be assumed. 2 In contrast, when the population responses are themselves the outputs of a recurrent network, the noise joint distribution is likely to be rather complex. Ignoring that complexity can lead to absurd conclusions, such as an apparent increase of information (Pouget et al., 1999). Since the actual reconstruction mechanisms used by the nervous system as well as their outputs are rarely observable, one has to rely on theoretical limits (i.e., the Crame r-Rao bound), ignoring possible biological constraints and noise originating at the reconstruction stage. Optimality criteria that may arise from the need to perform computation (and not just represent or transmit information) are also ignored. Even if these assumptions are accepted, it was recently shown (Zhang & Sejnowski, 1999b) that the optimal tuning width is biologically implausible: as narrow as possible3 when x is one-dimensional, irrelevant when x is twodimensional, and as broad as possible when x is more than two-dimensional. Thus, empirical observations such as cosine tuning are difcult to interpret as being optimal in the usual sense.
2 The noise terms e1 , . . . , en are usually modeled as independent or homogeneously correlated Poisson variables. 3 The nite number of neurons in the population prevents innitely sharp tuning (i.e., the entire range of x has to be covered), but that is a weak constraint since a given area typically contains large numbers of neurons.
Cosine Tuning Minimizes Motor Errors
1235
In this article, we pursue an alternative approach. The optimal tuning function f ¤ is still dened as the one that maximizes the accuracy of the reconstruction m 1 C e1 , . . . , m n C en ! b x. However, we do not speculate that the input noise distribution has any particular form or that the reconstruction is optimal. Instead, we use knowledge of the actual reconstruction mechanisms and measurements of the actual output b x, which in the motor system is simply the net muscle force.4 That allows us to infer a direct mapping m 1 , . . . , m n ! Mean (b x ) , Var (b x ) from the mean of the inputs to the mean and variance of the output. Once such a mapping is available, the form of the input noise and the amount of information about x that in principle could have been extracted become irrelevant to the investigation of optimal tuning. 1.2 Optimal Tuning in the Motor System. We construct the mapping m 1 , . . . , m n ! Mean (b x ) , Var(b x ) based on two sets of observations, relating (1) the mean activations to the mean of the net force and (2) the mean to the variance of the net force. Under isometric conditions, individual muscles produce forces in proportion to the rectied and ltered electromyogram (EMG) signals (Zajac, 1989; Winter, 1990), and these forces add mechanically to the net force.5 Thus, the mean of the net force is simply the vector sum of the mean muscle activations m 1 , . . . , m n multiplied by the corresponding force vectors u1 , . . . , un (dening the constant lines of action). If the output cells of primary motor cortex (M1) contribute additively to the activation of muscle groups (Todorov, 2000), a similar additive model may apply for m 1 , . . . , m n corresponding to mean ring rates in M1. In the rest of the article, m 1 , . . . , m n will denote the mean activation levels of abstract force generators, which correspond to individual muscles or muscle groups. The relevance to M1 cell tuning is addressed in Section 6. Numerous studies of motor tremor have established that the standard deviation of the net force increases linearly with its mean. This has been demonstrated when tonic isometric force is generated by muscle groups (Sutton & Sykes, 1967) or individual muscles (McAuley, Rothwell, & Marsden, 1997). The same scaling holds for the magnitude of brief force pulses (Schmidt, Zelaznik, Hawkins, Frank, & Quinn, 1979). This general nding
4 We focus predominantly on isometric force tasks and extend our results to movement velocity and displacement tuning in the last section. Thus, the output (reconstruction) is dened as net force (i.e., vector sum of all individual muscle forces) in the relevant work space. 5 The contributions of different muscles to end-point force are determined by the Jacobian transformation and the tendon insertion points. Each muscle has a line of action (force vector) in end-point space, as well as in joint space. Varying the activation level under isometric conditions affects the force magnitude, but the force direction remains xed.
1236
Emanuel Todorov
is also conrmed indirectly by the EMG histograms of various muscles, which lie between a gaussian and a Laplace distribution, both centered at 0 (Clancy & Hogan, 1999). Under either distribution, the rectied signal |x| has standard deviation proportional to the mean.6 The above scaling law leads to a neuromotor noise model where each generator contributes force with standard deviation s linear in the mean m : s D am . This has an interesting consequence. Suppose we had two redundant generators pulling in the same direction and wanted them to produce net force m . If we activated only one of them at level m , the net variance would be s 2 D a2 m 2 . If we activated both generators at level m / 2, the net variance (assuming uncorrelated noise) would be s 2 D a2 m 2 / 2, which is two times smaller. Thus, it is advantageous to activate all generators pulling in the direction of desired net force. What about generators pulling in slightly different directions? If all of them are recruited simultaneously, the noise in the net force direction will still decrease, but at the same time, extra noise will be generated in orthogonal directions. So the advantage of activating redundant actuators decreases with the angle away from the net force direction. The main technical contribution of this article is to show that it decreases as a cosine, that is, cosine tuning minimizes expected motor errors. Note that the above setting of the optimal tuning problem is in open loop; the effects of activation level on feedback gains are not explicitly considered. Such effects should be taken into account because coactivation of opposing muscles may involve interesting trade-offs: it increases both neuromotor noise and system impedance and possibly modies sensory inputs (due to a ¡ c coactivation). We incorporate these possibilities by assuming that an independent cocontraction command C may be specied, in which case the net activity of all generators is constrained to be equal to C. As shown below, the optimal tuning curve is a cosine regardless of whether C is specied. The optimal setting of C itself will be addressed elsewhere. In the next section, we present new experimental results, conrming the reduction of noise due to redundancy. The rest of the article develops the mathematical argument for cosine tuning rigorously, under quite general assumptions. 2 Actuator Redundancy Decreases Neuromotor Noise The empirically observed scaling law s D am implies that activating redundant actuators should reduce the overall noise level. This effect forms the basis of the entire model, so we decided to test it experimentally. Ideally, we would ask subjects to produce specied forces by activating one versus two ), the mean of |x| is s and the variance exp ( ¡|x| s p is . For the 0-mean gaussian with standard deviation s, the mean of |x| is s 2 / p , and the variance is s 2 (1 ¡ 2 / p ). 6
s2
For the Laplace distribution ps (x) D
1 s
Cosine Tuning Minimizes Motor Errors
1237
synergistic muscles and compare the corresponding noise levels. But human subjects have little voluntary control over which muscles they activate, so instead we used the two hands as redundant force generators: we compared the force errors for the same level of net instructed force produced with one hand versus both hands. The two comparisons are not identical, since the neural mechanisms coordinating the two hands may be different from those coordinating synergistic muscles of one limb. In particular, one might expect coordinating the musculature of both hands to be more difcult, which would increase the errors in the two-hands condition (opposite to our prediction). Thus, we view the results presented here as strong supporting evidence for the predicted effect of redundancy on neuromotor noise.
2.1 Methods. Eight subjects produced isometric forces of specied magnitude (3–33 N) by grasping a force transducer disk (Assurance Technologies F/T Gamma 65/5, 500 Hz sampling rate, 0.05 N resolution) between the thumb and the other four ngers. The instantaneous force magnitude produced by the subject was displayed with minimum delay as a vertical bar on a linear 0–40N scale. Each of 11 target magnitudes was presented in a block of three trials (5 sec per trial, 2 sec between trials), and the subjects were asked to maintain the specied force as accurately as possible. The experiment was repeated twice: with the dominant hand and with both hands grasping the force transducer. Since forces were measured along the forward axis, the two hands can be considered as mechanically identical (i.e., redundant) actuators. To balance possible learning and fatigue effects, the order of the 11 force magnitudes was randomized separately for each subject (subsequent analysis revealed no learning effects). Half of the subjects started with both hands, the other half with the dominant hand. The rst 2 seconds of each trial were discarded; visual inspection conrmed that the 2 second initial period contained the force transient associated with reaching the desired force level. The remaining 3 seconds (1500 sample points) of each trial were used in the data analysis.
2.2 Results. The average standard deviations are shown in Figure 1B for each force level and hand condition. In agreement with previous results, the standard deviation in both conditions was a convincingly linear function of the instructed force level. As predicted, the force errors in the two-hands condition were smaller, and the ratio of the two slopes was 1.42 § 0.25 (95% condence interval), which is indistinguishable from the predicted value p of 2 t 1.41. Two-way (2 conditions £ 11 force levels) ANOVA with replications (eight subjects) indicated that both effects were highly signicant (p < 0.0001), and there was no interaction effect (p D 0.57). Plotting standard deviation versus mean (rather than instructed) force produced very similar results.
1238
Emanuel Todorov
2
R = 0.62
Force Bias (N)
0
0.1 2
R = 0.93 0.2
Dominant Hand Both Hands
0.3 3
6
9 12 15 18 21 24 27 30 33
Instructed Force (N)
B) Variable Error Standard Deviation (N)
A) Constant Error
0.1
0.4
R 2 = 0.98
0.3
0.2
R 2 = 0.97
0.1
0
3
6
9 12 15 18 21 24 27 30 33
Instructed Force (N)
Figure 1: The last 3 seconds of each trial were used to estimate the bias (A) and standard deviation (B) for each instructed force level and hand condition. Averages over subjects and trials, with standard error bars, are shown in the gure. The standard deviation estimates were corrected for sensor noise, measured by placing a 2.5 kg object on the sensor and recording for 10 seconds.
The nonzero intercept in our data was smaller than previous observations but still signicant. It is not due to sensor noise (as previously suggested), because we measured that noise and subtracted its variance. One possible explanation is that because of cocontraction, some force uctuations are present even when the mean force is 0. Figure 2A shows the power spectral density of the uctuations in the two conditions, separated into low (3–15N) and high (21–33N) force levels. The scaling is present at all frequencies, as observed previously (Sutton & Sykes, 1967). Both increasing actuator redundancy and decreasing the force level have similar effects on the spectral density. To identify possible differences between frequency bands, we low-pass-ltered the data at 5 Hz, and bandpass ltered at 5–25 Hz. As shown in Figure 2B, the noise in both frequency bands obeys the same scaling law: standard deviation linear in the mean, with higher slope in the one-hand condition. The slopes in the two frequency bands are different, and interestingly the intercept we saw before is restricted to low frequencies. If the nonzero intercept is indeed due to cocontraction, Figure 2B implies that the cocontraction signal (i.e., common central input to opposing muscles) uctuates at low frequencies. We also found small but highly signicant negative biases (dened as the difference between measured and instructed force) that increased with the instructed force level and were higher in the one-hand condition (see Figure 1A). This effect cannot be explained with perceptual or memory
Cosine Tuning Minimizes Motor Errors
A) Spectral Densities Average sensor noise Dominant Hand Both Hands
0
High Forces 10
30
Low Forces
1
5
10
15
Frequency (Hz)
20
25
0.4
Standard Deviation (N)
Average Power (dB)
10
20
1239
B) Frequency Bands Dominant Hand Both Hands R 2 = 0.98
0.3
0.2
0.1
0
2
0-5 Hz
R = 0.97 2
R = 0.94
5-25 Hz 3
6
R 2 = 0.99
9 12 15 18 21 24 27 30 33
Instructed Force (N)
Figure 2: (A) The power spectral density of the force uctuations was estimated using blocks of 500 sample points with 250 point overlap. Blocks were meandetrended, windowed using a Hanning window, and the squared magnitude of the Fourier coefcients averaged separately for each frequency. This was done separately for each instructed force level and then the low (3–15 N) and high (21–33 N) force levels were averaged. The data from some subjects showed a sharper peak around 8–10 Hz, but that is smoothed out in the average plot. There appears to be a qualitative change in the way average power decreases at about 5 Hz. (B) The data were low-pass-ltered at 5 Hz (fth-order Butterworh lter) and also bandpass ltered at 5–25 Hz. The standard deviation for each force level and hand condition was estimated separately in each frequency band.
limitations, since subjects received real-time visual feedback on a linear force scale. A similar effect is predicted by optimal force production: if the desired force level is m ¤ and we specify mean activation m for a single generator, the expected square error is (m ¡ m ¤ ) 2 C a2 m 2 , which is minimal for m D m ¤ / (1 C a2 ) < m ¤ . Thus, the optimal bias is negative, larger in the one-hand condition, and its magnitude increases with m ¤ . The slopes in Figure 1A are substantially larger than predicted, which is most likely due to a trade-off between error and effort (see section 5.2). Other possible explanations include an inaccurate internal estimate of the noise magnitude and a cost function that penalizes large uctuations more than the square error cost does (see section 5.4). Summarizing the results of the experiment, the neuromotor noise scaling law observed previously (Sutton & Sykes, 1967; Schmidt et al., 1979) was replicated. Our prediction that redundant generators reduce noise was conrmed. Thus, we feel justied in assuming that each generator contributes force whose standard deviation is a linear function of the mean.
1240
Emanuel Todorov
3 Force Production Model Consider a set V of force generators producing force (torque) in D-dimensional Euclidean space RD . Generator a 2 V produces force proportional to its instantaneous activation, always in the direction of the unit vector u(a) 2 RD .7 The dimensionality D can be only 2 or 3 for end-point force but much higher for joint torque—for example, 7 for the human arm. The central nervous system (CNS) species the mean activations m (a) , which are always nonnegative. The actual force contributed by each generator is (m (a) C z (a) ) u(a). The neuromotor noise z is a set of zero-mean random variables whose probability distribution p ( z|m ) depends on m (see below), and p ( z (a) < ¡m (a) ) D 0 since muscles cannot push. Thus, the net force r(m ) 2 RD is the random variable:8 r(m ) D |V| ¡1
X a2V
(m (a) C z (a) ) u(a) .
Given a desired net force vector P f 2 RD and optionally a cocontraction ¡1 command-net activation C D |V| a m (a), the task of the CNS is to nd the activation prole m (a) ¸ 0 that minimizes the expected force error under p ( z|m ) .9 We will dene error as the squared Euclidean distance between the desired force f and actual force r (alternative cost functions are considered in section 5.4). Note that both direction and magnitude errors are penalized, since both are important for achieving the desired motor objectives. The expected error is the sum of variance V and squared bias B, h i T (r ¡ f) , E z|m (r ¡ f) T (r ¡ f) D trace(Covz|m [r, r]) C |(r ¡ f){z } | {z } V
(3.1)
B
P where the mean force is r D |V| ¡1 a m (a) u(a) since E[z(a) ] D 0 for each a. We rst focus on minimizing the variance term V for specied mean force r and then perform another minimization with respect to (w.r.t.) r. Exchang-
7 a will be used interchangeably as an index over force generators in the discrete case and as a continuous index specifying direction in the continuous case. 8 ¡1 The R the transition to a continuous V later: P scaling constant |V| simplies |V| ¡1 a ¢ ¢ ¢ will be replaced with |SD | ¡1 ¢ ¢ ¢ da where |SD | is the surface area of the unit sphere in RD . It does not affect the results. 9 m (a) is the activation prole of all generators at one point in time, corresponding to a given net force vector. In contrast, a tuning curve is the activation of a single generator when the net force direction varies. When m (a) is symmetric around the net force direction, it is identical to the tuning curve of all generators. This symmetry holds in most of our results, except for nonuniform distributions of force directions. In that case, we will compute the tuning curve explicitly (see Figure 4).
Cosine Tuning Minimizes Motor Errors
ing the order of the trace, E, Á V D |V|
¡2
D |V| ¡2
"
trace Ez|m XX
P
1241
operators, the variance term becomes:
XX
#! z (a) z (b ) u(a) u(b ) T
a2V b 2V
Covz|m [z(a), z (b ) ]u(b ) T u(a) .
a2V b 2V
To evaluate V, we need a denition of the noise covariance Covz|m [z(a), z (b ) ] for any pair of generators a, b. Available experimental results only suggest the form of the expression for a D b; since the standard deviation is a linear function of the mean force, Covz|m [z(a) , z (a) ] is a quadratic polynomial of m (a) . This will be generalized to a quadratic polynomial across directions as ³ Covz|m [z(a) , z (b )] D
m (a) C m (b ) l1 m (a)m (b ) C l2 2
´ (dab C l3 ) .
b
The da term is a delta function corresponding to independent noise for each force generator. The correlation term l3 corresponds to uctuations in some shared input to all force generators. A correlation term dependent on the angle between u(a) and u(b ) is considered in section 5.3. ¡1 P Substituting in the above expression for V and dening U D |V| u(a), which is 0 when the force directions are uniformly distributed, a we obtain V D |V| ¡2
X a2V
(l1 m (a) 2 C l2 m (a) ) C l1 l3 rT r C l2 l3 rT U.
(3.2)
The optimal activation prole can be computed in two steps: (1) for given r in equation 3.2, nd the constrained minimum V ¤ (r) w.r.t. m ; (2) substitute in equation 3.1 and nd the minimum of V ¤ (r) C B (r) w.r.t. r. Thus, the shape of the optimal activation prole emerges in step 1 (see section 4), while the optimal bias is found in step 2 (see section 5.1). 4 Cosine Tuning Minimizes Expected Motor Error The minimization problem given by equation 3.2 is an instance of a more general minimization problem described next. We rst solve that general problem and then specialize the solution to equation 3.2. The set V we consider can be continuous or discrete. The activation function (vector) m 2 R (V) : V ! R is nonnegative for all a 2 V. Given arbitrary positive weighting function w 2 R (V), projection functions g1,...,N 2 R (V), resultant lengths r1,...,N 2 R, and offset l 2 R, we will solve the following
1242
Emanuel Todorov
minimization problem w.r.t. m : Minimize Subject to Where
hm C l, m C li m (a) ¸ 0 hm , g1...N i D r1,...,N P hu, vi , |V| ¡1 u (a) v (a) w (a).
(4.1)
a2V
The generalized dot product is symmetric, linear in both arguments, and positive denite (since w (a) > 0 by denition). Note that the dot product is dened between activation proles rather than force vectors. The solution to this general problem is given by the following result (see the appendix):10 P Theorem 1. If m ¤ (a) D b n an gn (a) ¡ lc satises hm ¤ , g1,...,N i D r1,...,N for some a1,...,N 2 R, then m ¤ is the unique constrained minimum of hm C l, m C li. Thus, the unique optimal solution to equation 4.1 is a truncated linear combination of the projection functions g1,...,N , assuming that a set of N constants a1,...,N 2 R satisfying the N constraints hm ¤ , g1,...,N i D r1,...,N exists. Although we have not been able to prove their existence in general, for the concrete problems of interest, these constants can be found by construction (see below). Note that the analytic form of m ¤ does not depend on the arbitrary weighting function w used to dene the dot product (the numerical values of the constants a1,...,N can of course depend on w). P In the case n an gn (a) ¸P l for all a 2 V, the constants a1,...,N satisfy the system of linear equations n an hgn , g k i D rk C hl, g k i for k D 1, . . . , N. It can be solved by inverting the matrix Gnk , hgn , gk i; for the functions g1,...,N considered in section 4.3, the matrix G is always invertible. 4.1 Application to Force Generation. We now clarify what this general result has to do with our problem. Recall that the goal is to nd the (a) ¸ 0 that minimizes equation 3.2 for nonnegative activation prole P m given P net force r D |V| ¡1 a m (a)u(a) and optionally cocontraction C D |V| ¡1 a m (a). Omitting theP last two terms in equation 3.2, P which are constant, we have to minimize a (l1m (a) 2 C l2m (a) ) D l1 a (m (a) C l) 2 C l2 const, where l , 2l . Choosing the weighting function w (a) D 1 and as1 suming l1 > 0 as the experimental data indicate (l1 is the slope of the regression line in Fig 1B), this is equivalent to minimizing the dot product hm C l, m C li. Let e1,...,D be an orthonormal basis of RD , with respect to which r has coordinates rT e1 , . . . , rT eD and u(a) has coordinates u(a) T e1 , . . . , u(a) T eD . Then we can dene r1,...,D , rT e1 , . . . , rT eD , rD C 1 , C, g1...D (a) , u(a) T e1 , . . . , 10
bxc D x for x ¸ 0 and 0 otherwise. Similarly, dxe D x for x < 0 and 0 otherwise.
Cosine Tuning Minimizes Motor Errors
1243
u(a) T eD , gD C 1 (a) , 1, N , D C 1 or D depending on whether the cocontraction command is specied. With these denitions, the problem is in the form of equation 4.1, theorem 1 applies, andP we are guaranteed that the unique optimal activation prole is m ¤ (a) D b n an gn (a) ¡ lc as long as we can nd a1,...,N 2 R for which m ¤ satises all constraints. Why is that function a cosine? The function gn (a) D u(a) T en is the cosine of the angle between unit vectors u(a)Pand en . A linear P combination (a) (a) T en D of D-dimensional cosines is also a cosine: a g D n n n P P n an u ¤ T T u(a) ( n an en ), and thus m (a) D bu(a) E ¡ lc for E D n an en . When C is specied, we have m ¤ (a) D bu(a) T E C aD C 1 ¡ lc since gD C 1 (a) D 1. Note that if we are given E, the constants a are simply an D ET en since the basis e1,...,D is orthonormal. To summarize the results so far, we showed that the minimum generalized length hm C l, m C li of the nonnegative activation prole m (a) subject to linear equality constraints hm , g1,...,N i D r1,...,N is achieved for a truncated P linear combination b n an gn (a) ¡lc of the projection functions g1,...,N . Given a mean force r and optionally a cocontraction command C, this generalized length is proportional to the variance of the net muscle force, with the projection functions being cosines. Since a linear combination of cosines is a cosine, the optimal activation prole is a truncated cosine. In the rest of this section, we compute the optimal activation prole in two special cases. In each case, all we have to do is construct—by whatever means—a function of the specied form that satises all constraints. Theorem 1 then guarantees that we have found the unique global minimum. 4.2 Uniform Distribution of Force Directions in RD . For convenience, we will work with a continuous set of force generators V D SD , the unit sphere embedded in RD . The summation signs will be replaced by integrals, and the force generator index a will be assumed to cover SD uniformly, that is, the distribution of force directions is uniform. The normalization constant becomes the surface area11 of the unit sphere | SD | D D 2p 2 / C ( D2 ) . The unit vector u(a) 2 RD corresponds to point a on SD . The goal R is to nd a truncated cosine function thatR satises the constraints | SD | ¡1 SD m ¤ (a)u(a) da D r and optionally | SD | ¡1 SD m ¤ (a) da D C. We will look for a solution with axial symmetry around r, that is, a m ¤ (a), that depends only on the angle between the vectors r and u(a) rather than the actual direction u(a) . This problem can be transformed into a problem on the circle in R2 by correcting for the area of SD being mapped into each point on the circle. 11 The rst few values of |SD | are: |S1,...,7 | D (2, 2p , 4p , 2p 2 , 83 p 2 , p 3 , cally |SD | decreases for D > 7.
16 3 ). p 15
Numeri-
1244
Emanuel Todorov
The set of unit vectors u 2 RD at angle a away from a given vector r 2 RD is a sphere in RD¡1 with radius | sin(a) |. Therefore, Rfor any function f : SD ! R with axial symmetry around r, we have SD f D Rp 1 (a) | sin(a) | D¡2 da, where the correction factor | SD¡1 ||sin(a)| D¡2 | | 2 SD¡1 ¡p f is the surface area of an RD¡1 sphere with radius | sin(a) |. Without loss of generality, r can be assumed to point along the positive x-axis: r D [R 0]. For given dimensionality D, dene the weighting function wD (a) , 12 | SD¡1 | | sin(a) | D¡2 for a 2 [¡p I p ], which as before denes the dot Rp product hu, viD , | SD | ¡1 ¡p u (a) v (a) wD (a) da. The projection functions on the circle in R2 are g1 (a) D cos(a) , g2 (a) D sin(a) , and optionally g3 (a) D 1. Thus, m ¤ has to satisfy hm ¤ , cosiD D R, hm ¤ , siniD D 0, and optionally ¤ hm , 1iD D C. Below, we set a2 D 0 and nd constants a1 , a3 2 R for which the function m ¤ (a) D ba1 cos(a) C a3 c satises those constraints. Since hba1 cos(a) C a3 c, siniD D 0 for any a1 , a3 , we are concerned only with the remaining two constraints. Note that hm ¤ , cosiD · hm ¤ , 1iD and therefore R · C whenever C is specR ied. Also, from the denition of wD (a) and the identity D cos2 sinD¡2 D R sinD¡1 cos C sinD¡2 , it follows that hcos, 1iD D 0, h1, 1iD D 1, and hcos, cosiD D D1 .
4.2.1 Specied Cocontraction C. We are looking for a function of the form m ¤ (a) D ba1 cos(a) C a3 c that satises the equality constraints. For a3 ¸ a1 , this function is a full cosine. Using the above identities, we nd that the constraints R D hm ¤ , cosiD D a1 hcos, cosiD C a3 h1, cosiD and C D hm ¤ , 1iD D a1 hcos, 1iD C a3 h1, 1iD are satised when a1 D DR and a3 D C. Thus, the optimal solution is a full cosine when CR ¸ D (corresponding to a3 ¸ a1 ). When CR < D, a full cosine solution cannot be found; thus, we look for a truncated cosine solution. Let the truncation point be a D § t, that is, a3 D ¡a1 cos(t) . To satisfy all constraints, t has to be the root of the trigonometric equation, sin(t) D¡1 / ( D ¡ 1) ¡ cos(t) ID ( t) C D , R ID ( t) ¡ cos(t) sin(t) D¡1 / D ( D ¡ 1) R where ID ( t) , 0t sin(a) D¡2 da. That integral can be evaluated analytically for any xed D. Once t is computed numerically, the constant a1 is given D| D¡1 D ( D ¡ 1) ) ¡1 . It can be veried that ( ( ) by a1 D R |S|SD¡1 / | ID t ¡ cos(t) sin(t) the above trigonometric equation has a unique solution for any value of C in the interval (1, D ). Values smaller than 1 are inadmissible because R R · C.
Cosine Tuning Minimizes Motor Errors
A) C specified
135
D=7 90
D=2 45
0
0
0.25
0.5
0.75
B) C unspecified
180
Tuning Width t (deg)
180
Tuning Width t (deg)
1245
135
90
D=2
45
D=7
0
1
-3
(C/R-1) / (D-1)
-2
-1
-
0
1
/ RD
Figure 3: (A) The truncation point–optimal tuning width t computed from equation 4.2 for D D 2, . . . , 7. Since the solution is a truncated cosine when 1 < CR < D, / R¡1 , which varies from 0 to 1 reit is natural to scale the x-axis of the plot as CD¡1 gardless of the dimensionality D. For CR ¸ D, we have the full cosine solution, which technically corresponds to t D 180. (B) The truncation point–optimal tuning width t computed from equation 4.3 for D D 2, . . . , 7. Since the solution is a ¡l truncated cosine when ¡l < D, it is natural to scale the x-axis of the plot as RD , R which varies from ¡1 to 1 regardless of the dimensionality D. For ¡l ¸ D, we R have the full cosine solution: t D 180.
Summarizing the solution,
m
¤ (a)
D
8 > :a1 bcos(a) ¡ cos(t) c
C ¸D R C < D. : R :
(4.2)
In Figure 3A we have plotted the optimal tuning width t in different dimensions, for the truncated cosine case CR < D. 4.2.2 Unspecied Cocontraction C. In this case, a3 D ¡l, that is, m ¤ (a) D ba1 cos(a) ¡ lc. For ¡l ¸ a1 , the solution is a full cosine, and a1 D DR as before. When ¡l < DR, the solution is a truncated cosine. Let the truncation point be a D §t. Then a1 D cosl( t) , and t has to be the root of the trigonometric equation: | SD | l ( ID (t) ¡ cos(t) sin(t) D¡1 / D ( D ¡ 1) ) ¡1 . DR | SD¡1 | cos(t) Note that a1 as a function of t is identical to the previous case when C was xed, while the equation for t is different.
1246
Emanuel Todorov
Summarizing the solution: 8 > :a1 bcos(a) ¡ cos(t) c : ¡l < D R ¤
:
(4.3)
In Figure 3B we have plotted the optimal tuning width t in different dimensions, for the truncated cosine case ¡l < D. R Comparing the curves in Figure 3A and Figure 3B, we notice that in both cases, the optimal tuning width is rather large (it is advantageous to activate multiple force generators), except in Figure 3A for C / R ¼ 1. From the triangle inequality, a solution m (a) exists only when C ¸ R, and C D R 6 implies m (a D 0) D 0. Thus, a small C “forces” the activation prole to become a delta function. But as soon as that constraint is relaxed, the width of the optimal solution increases sharply. Note also that in both gures, the dimensionality D makes little difference after appropriate scaling of the abscissa. 4.3 Arbitrary Distribution of Force Directions in R2 . For a Puniform dis-2 tribution of force directions, it was possible to replace the term a (m (a) C l) R with SD (m (a) C l) 2 da in equations 3.2. If the distribution is not uniform but instead is given by some density function w (a) , we have to take that function m ¤ that minimizes R into¤ account2 and nd the activation¡1 prole R ¡1 ¤ (m (a) C l) w (a) da subject to | SD | | SD | m (a) u(a)w(a) da D r SD SD R ¡1 ¤ (a) (a) | | and optionally SD w da D C. Theorem 1 still guarantees that SD m the optimal m ¤ is a truncated cosine, assuming we can nd a truncated cosine satisfying the constraints. It is not clear how to do that for arbitrary dimensionality D and arbitrary density w, so we address only the case D D 2. For arbitrary w (a) and D D 2, the solution is in the form m ¤ (a) D ba1 cos(a) C a2 sin(a) C a3 c. When C is not specied, we have a3 D ¡l. Here we evaluate these parameters only when C is specied and large enough to ensure a full cosine solution. The remaining cases can be handled using (a) in a Fourier techniques similar to Pthe previous sections. Expanding w ( ) C series, w (a) D u20 C 1 cos sin and solving the system of u na v na n nD 1 n linear equations given by the constraints, we obtain 2u C u 0 2 2 3 a1 6 2 6 7 6 v2 4a2 5 D 6 42 a3 u1
v2 2 u 0 ¡ u2 2 v1
u1
3 ¡1
7 7 v1 7 5 u0
2
3 2R 6 7 40 5 . 2C
The optimal m ¤ depends only on the Fourier coefcients of w (a) up to order 2; the higher-order terms do not affect the minimization problem. In
Cosine Tuning Minimizes Motor Errors
1247
the previous sections, w (a) was equal to the dimensionality correction factor | sin(a) | D¡2 , in which case only u 0 and u2 were nonzero, the above matrix became diagonal, and thus we had a1 s R, a2 D 0, a3 s C. Note that m ¤ (a) is the optimal activation prole over the set of force generators for xed mean force. In the case of uniformly distributed force directions, this also described the tuning function of an individual force generator for varying mean force direction, since m ¤ (a) was centered at 0 and had the same shape regardless of force direction. That is no longer true here. Since w (a) can be asymmetric, the directions of the force generator and the mean force matter (as illustrated in Figures 4a and 4b). The tuning functions of several force generators at different angles from the peak of w (a) are plotted in Figures 4a and 4b. The direction of maximal activation rotates away from the generator force direction and toward the short axis of w (a) , for generators whose force direction lies in between the short and long axes of w (a). This effect has been observed experimentally for planar arm movements, where the distribution of muscle lines of action is elongated along the hand-shoulder axis (Cisek & Scott, 1998). In that case, muscles are maximally active when the net force is rotated away from their mechanical line of action, toward the short axis of the distribution. The same effect is seen in wrist muscles, where the distribution of lines of action is again asymmetric (Hoffman & Strick, 1999). The tuning modulation (difference between the maximum and minimum of the tuning curve) also varies systematically, as shown in Figures 4a and 4b. Such effects are more difcult to detect experimentally, since that would require comparisons of the absolute values of signals recorded from different muscles or neurons. 5 Some Extensions 5.1 Optimal Force Bias. The optimal force bias can be found by solving equation 3.1: minimize V¤ (r) C B (r) w.r.t. r. We will solve it analytically only for a uniform distribution of force directions in RD and when the minimum in equation 3.2 is a full cosine. It can be shown using equations 4.1 and 4.2 that for both C, specied and unspecied, the variance term dependent on l1 D 2 m ¤ in equation 3.2 is |S R . It is clear that the optimal mean force r is parallel D| to the desired force f, and all we have to nd is its magnitude R D krk. Then to solve equation 3.1, we have to minimize w.r.t. R the following expression: l1 D 2 2 2 ( |SD | R C l1 l3 R C R ¡ kfk) . Setting the derivative to 0, the minimum is achieved for kfk krk D . 1 C l1 l3 C l1 D / | SD | Thus, for positive l1 , l3 the optimal mean force magnitude krk is smaller than the desired force magnitude kfk, and the optimal bias krk¡kfk increases linearly with kfk.
1248
Emanuel Todorov
A) w(a) = 1 + 0.8 cos(2a)
B) w(a) = 1 + 0.25 cos(a) 8
Optimal Generator Tuning
Optimal Generator Tuning
8
90 6
0
4
2
w(a) 0 -180
0
180
Desired - Generator Direction
180
6
4
0
2
w(a) 0 -180
0
180
Desired - Generator Direction
Figure 4: Optimal tuning of generators whose force directions point at different angles relative to the peak of the distribution w ( a ) . 0 corresponds to the peak of w ( a) , and 90 (180, respectively) corresponds to the minimum of w ( a ) . The polar plots show w ( a) , and the lines inside indicate the generator directions plotted in each gure. We used R D 1, C D 5. (A) w ( a ) has a second-order harmonic. In this case, the direction of maximal activation for generators near 45 rotates toward the short axis of w ( a ) . The optimal tuning modulation increases for generators near 90. (B) w ( a) has a rst-order harmonic. In this case, the rotation is smaller, and the tuning curves near the short axis of w ( a ) shift upward rather than increasing their modulation.
5.2 Error-Effort Trade-off. Our formulation of the optimal control problem facing the motor system assumed that the only quantity being minimized is error (see equation 3.1). It may be more sensible, however, to minimize a weighted sum of error and effort, because avoiding fatigue in the current task can lead to smaller errors in tasks performed in the future. Indeed, we have found evidence for error+effort minimization in movement tasks (Todorov, 2001). To allow this possibility here, we consider a modied cost function of the form X ¡2 m (a) 2 . Ez|m [(r ¡ f) T (r ¡ f) ] C b |V| a2V
The only change resulting from the inclusion of the activation penalty term is that the variance V previously given by equation 3.2 now becomes X ( (l1 C b )m (a) 2 C l2m (a) ) C l1 l3 rT r C l2 l3 rT U. V D |V| ¡2 a2V
Thus, the results in section 4 remain unaffected (apart from the substitution l1 Ã l1 C b), and the optimal tuning curve is the same as before. The only
Cosine Tuning Minimizes Motor Errors
1249
effect of the activation penalty is to increase the force bias. The optimal krk computed in section 5.1 now becomes krk D
kfk . 1 C l1 l3 C (l1 C b ) D / | SD |
Thus, the optimal bias krk ¡kfk increases with the weight b of the activation penalty. This can explain why the experimentally observed bias in Figure 1A was larger than predicted by minimizing error alone. 5.3 Nonhomogeneous Noise Correlations. Thus far, we allowed only homogeneous correlations (l3 ) among noise terms affecting different generators. Here, we consider an additional correlation term (l4 ) that varies with the angle between two generators. The noise covariance model Covz|m [z(a) , z (b ) ] now becomes ³ ´ m (a) C m (b ) (dab C l3 C 2l4 u(b ) T u(a) ) . l1m (a)m (b ) C l2 2 We focus on the case when the force generators are uniformly distributed in a two-dimensional work space ( D D 2) , the mean force is r D [R 0] as before, the cocontraction level C is specied, and CR ¸ D. Using the identities u(b ) T u(a) D cos(a ¡ b ) and 2 cos2 (a ¡ b ) D 1 C cos(2a ¡ 2b ) , the force variance V previously given by equation 3.2 now becomes 1 4p 2
Z
l1 l4 2 ( p2 C q22 ) C l1 l4 C2 C l2 l4 C, 4 R R where p2 D p1 m (a) cos(2a) da and q2 D p1 m (a) sin(2a)da are the secP p ond order coefcients in the Fourier series m (a) D 20 C 1 n D 1 ( pn cos na C qn sin na) . The integral term in V can be expressed as a function of the Fourier coefR 1 cients using Parseval’s theorem. The constraints on m (a) are 2p m (a) da D R R 1 1 m (a) cos(a) da D R, and 2p m (a) sin(a) da D 0. These constraints C, 2p specify the p 0, p1 , and q1 Fourier coefcients. Collecting all unconstrained terms in V yields (l1 m (a) 2 C l2m (a) ) da C l1 l3 rT r C
VD
l1 4
³ l4 C
1 p
´ ( p22 C q22 ) C
1 l1 X ( p2 C q2n ) C const(C, r) . 4p nD 3 n
Since the parameter l1 corresponding to the slope of the regression line in Figure 1B is positive, the above expression is a sum of squares with positive weights when l4 > ¡ p1 . The unique minimum is then achieved when p2,...,1 D q2,...,1 D 0, and therefore the optimal tuning curve is m (a) D DR cos(a) C C as before.
1250
Emanuel Todorov
If nonhomogeneous correlations are present, one would expect muscles pulling in similar directions to be positively correlated (l4 > 0), as simultaneous EMG recordings indicate (Stephens, Harrison, Mayston, Carr, & Gibbs, 1999). This justies the assumption l4 > ¡ p1 . 5.4 Alternative Cost Functions. We assumed that the cost being minimized by the motor system is the square error of the force output. While a square error cost is common to most optimization models in motor control (see section 6), it is used for analytical convenience without any empirical support. This is not a problem for phenomenological models that simply look for a quantity whose minimum happens to match the observed behavior. But if we are to construct more principled models and claim some correspondence to a real optimization process in the motor system, it is necessary to conrm the behavioral relevance of the chosen cost function. How can we proceed in the absence of such empirical conrmation? Our approach is to study alternative cost functions, obtain model predictions through numerical simulation, and show that the particular cost function being chosen makes little difference. Throughout this section, we assume that C is specied, the work space is two-dimensional, and the target force (without loss of generality) is f D [R 0]. The cost function is now Costp (m ) D E (kr ¡ fkp ) . We nd numerically the optimal activations m 1,...,15 for 15 uniformly distributed force generators. The noise terms z1,...,15 are assumed independent, with probability distribution matching the experimental data. In order to generate such noise terms, we combined the data for each instructed force level (all subjects, one-hand condition), subtracted the mean, divided by the standard deviation, and pooled the data from all force levels. Samples from the distribution of zi were then obtained as zi D l1 m i s, where s was sampled with replacement from the pooled data set. The scaling constant was set to l1 D 0.2. It could not be easily estimated from the data (because subjects used multiple muscles), but varying it from 0.2 to 0.1 did not affect the results presented here, as expected from section 4. To nd the optimal activations, we initialized m 1,...,15 randomly, and minimized the Monte Carlo estimate of Costp (m ) using BFGS gradient-descent with numerically computed gradients (fminunc in the Maltab Optimization 1 P Toolbox). The constraints m i ¸ 0 and 15 m i D C were enforced by scaling and using the absolute P value of m i inside the estimation function. A small 1 cost proportional to | 15 m i ¡ C| was added to resolve the scaling ambiguity. To speed up convergence, a xed 15 £ 100,000 random sample from the experimental data was used in each minimization run. The average of the optimal tuning curves found in 40 runs of the algorithm (using different starting points and random samples) is plotted in
Cosine Tuning Minimizes Motor Errors
1251
8
Optimal Activation
p = 0.5 p=2 p=4 6
R=1 C=4
4
2
R=1 C = 1.275 0 -180
0
180
Figure 5: The average of 40 optimal tuning curves, for p D 0.2 and p D 4. The different tuning curves found in multiple runs were similar. The solution for p D 2 was computed using the results in section 4.
Figure 5, for p D 0.5 and p D 4. The optimal tuning curve with respect to the quadratic cost ( p D 2) is also shown. For both full and truncated cosine solutions, the choice of cost function made little difference. We have repeated this analysis with gaussian noise and obtained very similar results. It is in principle possible to compare the different curves in Figure 5 to experimental data and try to identify the true cost function used by the motor system. However, the differences are rather small compared to the noise in empirically observed tuning curves, so this analysis is unlikely to produce unambiguous results. 5.5 Movement Velocity and Displacement Tuning. The above analysis explains cosine tuning with respect to isometric force. To extend our results to dynamic conditions and address movement velocity and displacement tuning, we have to take into account the fact that muscle force production is state dependent. For a constant level of activation, the force produced by a muscle varies with its length and rate of change of length (Zajac, 1989), decreasing in the direction of shortening. The only way the CNS can generate a desired net muscle force during movement is to compensate for this dependence: since muscles pulling in the direction of movement are shortening, their force output for xed neural input drops, and so their neural input has to increase. Thus, muscle activation has to correlate with movement velocity and displacement (Todorov, 2000). Now consider a short time interval in which neural activity can change, but all lengths, velocities, and forces remain roughly constant. In this setting, the analysis from the preceding sections applies, and the optimal tun-
1252
Emanuel Todorov
ing curve with respect to movement velocity and displacement is again a (truncated) cosine. While the relationship between muscle force and activation can be different in each time interval, the minimization problem itself remains the same; thus, each solution belongs to the family of truncated cosines described above. The net muscle force that the CNS attempts to generate in each time interval can be a complex function of the estimated state of the limb and the task goals. This complexity, however, does not affect our argument: we are not asking how the desired net muscle force is computed but how it can be generated accurately once it has been computed. The quasi-static setting considered here is an approximation, which is justied because the neural input is low-pass-ltered before generating force (the relationship between EMG and muscle force is well modeled by a second-order linear lter with time constants around 40 msec; Winter, 1990), and lengths and velocities are integrals of the forces acting on the limb, so they vary even more slowly compared to the neural input. Replacing this approximation with a more detailed model of optimal movement control is a topic for future work. 6 Discussion In summary, we developed a model of noisy force production where optimal tuning is dened in terms of expected net force error. We proved that the optimal tuning curve is a (possibly truncated) cosine, for a uniform distribution w (a) of force directions in RD and for an arbitrary distribution w (a) of force directions in R2 . When both w (a) and D are arbitrary, the optimal tuning curve is still a truncated cosine, provided that a truncated cosine satisfying all constraints exists. Although the analytical results were obtained under the assumptions of quadratic cost and homogeneously correlated noise, it was possible to relax these assumptions in special cases. Redening optimal tuning in terms of error+effort minimization did not affect our conclusions. The model makes three novel and somewhat surprising predictions. First, the model predicts a relationship between the shape of the tuning curve m (a) and the cocontraction level C. According to equation 4.2, when C is large enough, the optimal tuning curve m (a) D DR cos(a) C C is a full cosine, which scales with the magnitude of the net force R and shifts with C. But when C is below the threshold value DR, the optimal tuning curve is a truncated cosine, which becomes sharper as C decreases. Thus, we would expect to see sharper-than-cosine tuning curves in the literature. Such examples can indeed be found in Turner et al. (1995) and Hoffman and Strick (1999). A more systematic investigation in M1 (Amirikian & Georgopoulos, 2000) revealed that the tuning curves of most cells were better t by sharperthan-cosine functions, presumably because of the low cocontraction level. We recently tested the above prediction using both M1 and EMG data and found that cells and muscles that appear to have higher contributions to
Cosine Tuning Minimizes Motor Errors
1253
the cocontraction level also have broader tuning curves, whose average is indistinguishable from a cosine (Todorov et al., 2000). This prediction can be tested more directly by asking subjects to generate specied net forces and simultaneously achieve different cocontraction levels. Second, under nonuniform distributions of force directions, the model predicts a misalignment between preferred and force directions, while the tuning curves remain cosine. This effect has been observed by Cisek and Scott (1998) and Hoffman and Strick (1999). Note that a nonuniform distribution of force directions does not necessitate misalignment; instead, the asymmetry can be compensated by using skewed tuning curves. Third, our analysis shows that optimal force production is negatively biased; the bias is larger when fewer force generators are active and increases with mean force. The measured bias was larger than predicted from error minimization alone, which suggests that the motor system minimizes a combination of error and effort in agreement with results we have recently obtained in movement tasks (Todorov, 2001). The model for the rst time demonstrates how cosine tuning could result from optimizing a meaningful objective function: accurate force production. Another model proposed recently (Zhang & Sejnowski, 1999a) takes a very different approach. It assumes a universal rule for encoding motion information in both sensory and motor areas, 12 which gives rise to cosine tuning. Its main advantage is that tuning for movement direction can be treated in the same framework in all parts of the nervous system, regardless of whether the motion signal is related to a body part or an external object perceived visually. But that model has two disadvantages: (1) it cannot explain cosine tuning with direction of force and displacement in the motor system, and (2) cosine tuning is explained with a new encoding rule that remains to be veried experimentally. If the new encoding rule is conrmed, it would provide a mechanistic explanation of cosine tuning that does not address the question of optimality. In that sense, the model of Zhang and Sejnowski (1999a) can be seen as being complementary to ours. 6.1 Origins of Neuromotor Noise. The origin and scaling properties of neuromotor noise are of central importance in stochastic optimization models of the motor system. The scaling law relating the mean and standard deviation of the net force was derived experimentally. What can we say about the neural mechanisms responsible for this type of noise? Very little, unfortunately. Although a number of studies on motor tremor have analyzed the peaks in the power spectrum and how they are affected by different experimen12 Assume each cell has a “hidden” function W (x) and encodes movement in x 2 RD by P ring in proportion to dW (x(t)) / dt. From the chain rule dW / dt D @W / @x . dx / dt D r W . x. This is the dot product of a cell-specic “preferred direction” r W and the movement velocity vector x—thus, P cosine tuning for movement velocity.
1254
Emanuel Todorov
tal manipulations, no widely accepted view of their origin has emerged (McAuley et al., 1997). Possible explanations include noise in the central drive, oscillations arising in spinal circuits, effects of afferent input, and mechanical resonance. One might expect the noise in the force output to reect directly the noise in the descending M1 signals, in agreement with the nding that magnetoencephalogram uctuations recorded over M1 are synchronous with EMG activity in contralateral muscles (Conway et al., 1995). On the level of single cells and muscles, however, this relationship is quite complicated. Cells in M1 (and most other areas of cortex) are well modeled as Poisson processes with coefcients of variation (CV) around 1 (Lee, Port, Kruse, & Georgopoulos, 1998). For a Poisson process, the spike count in a xed interval has variance (rather than standard deviation) linear in the mean. The ring patterns of motoneurons are nothing like Poisson processes. Instead, motoneurons re much more regularly, with CVs around 0.1 to 0.2 (DeLuca, 1995). Furthermore, muscle force is controlled to a large extent by recruiting new motor units, so noise in the force output may arise from the motor unit recruitment mechanisms, which are not very well understood. Other physiological mechanisms likely to affect the output noise distribution are recurrent feedback through Renshaw cells (which may serve as a decorrelating mechanism; Maltenfort, Heckman, & Rymer, 1998), as well as plateau potentials (caused by voltage-activated calcium channels) that may cause sustained ring of motoneurons in the absence of synaptic input (Kiehn & Eken, 1997). Also, muscle force is not just a function of motoneuronal ring rate, but depends signicantly on the sequence of interspike intervals (Burke, Rudomin, & Zajac, 1976). 13 Thus, although the mean ring rates of M1 cells seem to contribute additively to the mean activations of muscle groups (Todorov, 2000), the small timescale uctuations in M1 and muscles have a more complex relationship. The motor tremor illustrated in Figure 1B should not be thought of as being the only source of noise. Under dynamic conditions, various calibration errors (such as inaccurate internal estimates of muscle fatigue, potentiation, length, and velocity dependence) can have a compound effect resembling multiplicative noise. This may be why the errors observed in dynamic force tasks (Schmidt et al., 1979) as well as reaching without vision (Gordon, Ghilardi, Cooper, & Ghez, 1994) are substantially larger than what the slopes in Figure 1B would predict. 6.2 From Muscle Tuning to M1 Cell Tuning. Since M1 cells are synaptically close to motoneurons (in some cases, the projection can even be monosynaptic; Fetz & Cheney, 1980), their activity would be expected to
13 Because of this nonlinear dependence, muscle force would be much noisier if motoneurons had Poisson ring rates, which may be why they re so regularly.
Cosine Tuning Minimizes Motor Errors
1255
reect properties of the motor periphery. The dening feature of a muscle is its line of action (determined by the tendon insertion points), in the same way that the dening feature of a photoreceptor is its location on the retina. A xed line of action implies a preferred direction, just like a xed retinal location implies a spatially localized receptive eld. Thus, given the properties of the periphery, the existence of preferred directions in M1 is no more surprising than the existence of spatially localized receptive elds in V1.14 Of course, directional tuning of muscles does not necessitate similar tuning in M1, in the same way that cells in V1 do not have to display spatial tuning; one can imagine, for example, a spatial Fourier transform in the retina or lateral geniculate nucleus that completely abolishes the spatial tuning arising from photoreceptors. But perhaps the nervous system avoids such drastic changes in representation, and tuning properties that arise (for whatever reason) in one area “propagate” to other densely connected areas, regardless of the direction of connectivity. Using this line of reasoning and the fact that muscle activity has to correlate with movement velocity and displacement in order to compensate for muscle visco-elasticity (see section 5.5), we have previously explained a number of seemingly contradictory phenomena in M1 without the need to evoke abstract encoding principles (Todorov, 2000). This article adds cosine tuning to that list of phenomena. We showed here that because of the multiplicative nature of motor noise, the optimal muscle tuning curve is a cosine. This makes cosine tuning a natural choice for motor areas that are close to the motor periphery. Motor areas that are further removed from motoneurons have less of a reason to display cosine tuning. Cerebellar Purkinje cells, for example, are often tuned for a limited range of movement speeds, and their tuning curves can be bimodal (Coltz, Johnson, & Ebner, 1999). 6.3 Optimization Models in Motor Control. A number of earlier optimization models explain aspects of motor behavior as emerging from the minimization of some cost functional. The speed-accuracy trade-off known as Fitt’s law has been modeled in this way (Meyer, Abrams, Kornblum, Wright, & Smith, 1988; Hoff, 1992; Harris & Wolpert, 1998). The reaching movement trajectory that minimizes expected end-point error is computed under a variety of assumptions about the control system (intermittent versus continuous, open loop versus closed loop) and the noise scaling properties (velocity- versus neural-input-dependent). While each model has advantages and disadvantages in tting existing data, they all capture the basic logarithmic relationship between target width and movement duration.
14 From this point of view, orientation tuning in V1 is surprising because it does not arise directly from peripheral properties. An equally surprising and robust phenomenon in M1 has not yet been found.
1256
Emanuel Todorov
This robustness with respect to model assumptions suggests that Fitt’s law indeed emerges from error minimization. Another set of experimental results that optimization models have addressed are kinematic regularities observed in hand movements (Morasso, 1981; Lacquaniti, Terzuolo, & Viviani, 1983). While a number of physically relevant cost functions (e.g., minimum time, energy, force, impulse) were investigated (Nelson, 1983), better reconstruction of the bell-shaped speed proles of reaching movements was obtained (Hogan, 1984) by minimizing squared jerk (derivative of acceleration). Recently, the most accurate reconstructions of complex movement trajectories were also obtained by minimizing under different assumptions the derivative of acceleration (Todorov & Jordan, 1998) or torque (Nakano et al., 1999). While these ts to experimental data are rather satisfying, the seemingly arbitrary quantity being minimized is less so. The stochastic optimization model of Harris and Wolpert (1998) takes a more principled approach: it minimizes expected end-point error assuming that the standard deviation of neuromotor noise is proportional to the mean neural activation. Shouldn’t that result in minimizing force and acceleration, which, as Nelson (1983) showed, yields unrealistic trajectories? It should, if muscle activation and force were identical, but they are not; instead muscle force is a low-pass-ltered version of activation (Winter, 1990). As a result, the neural signal under dynamic conditions contains terms related to the derivative of force, and so the model of Harris and Wolpert (1998) effectively minimizes a cost that includes jerk or torque change along with other terms. It will be interesting to nd tasks where maximizing accuracy and maximizing smoothness make different predictions and test which prediction is closer to observed trajectories. The noise model used by Harris and Wolpert (1998) is identical to ours under isometric conditions. During movement, it is not known whether noise magnitude is better t by mean force (as in the present model) or muscle activation (as in Harris & Wolpert, 1998). Our conclusions should not be sensitive to such differences, since we do not rely on muscle low-pass ltering to explain cosine tuning. Nevertheless, it is important to establish experimentally the properties of neuromotor noise during movement. Appendix The crucial fact underlying the proof of theorem 1 is that the linear span L of the functions g1,...,N is orthogonal to the hyperplane P dened by the equality constraints in equation 4.1. Lemma 1. For any a1,...,N 2 R and u, v 2 R (V) P satisfying hu, g1,...,N i D hv, g1,...,N i D r1,...,N , the R (V) function l (a) D n an gn (a) is orthogonal to u ¡ v, that is, hu ¡ v, li D 0.
Cosine Tuning Minimizes Motor Errors
1257
( ) *
~
( )
( ) F
F
¤ Figure m (a) , D (a) in theorem 1, case 2, P6: Illustration of the functions m (a) , e with an gn (a) D cos(a) . The shaded region is the set z where cos(a) < 0. The key point is that D (a) e m (a) · 0 for all a.
Proof. hu ¡ v, li D hu, P ( a r n n n ¡ rn ) D 0.
P
n an gn i
¡ hv,
P
n an gn i
D
P
n an
(hu, gn i ¡ hv, gn i) D
The quantity hm C l, m C li we want to minimize is a generalized length, the solution m is constrained to the hyperplane P orthogonal to L , and L contains the origin 0. Thus, we would intuitively expect the optimal solution m ¤ to be close to the intersection of P and L , that is, to resemble a linear combination of g1,...,N . The nonnegativity constraint on m introduces complications that are handled in case 2 (see Figure 6). The proof of theorem 1 is the following: Proof of Theorem 1. Let m D m ¤ C D for some D 2 R (V) be another function satisfying all constraints in equation 4.1. Using the linearity and symmetry of the dot product, hm C l, m C li D hm ¤ C l, m ¤ C li C 2hm ¤ C l, D i C hD , D i. The term hD , D i is always nonnegative and becomes 0 only when D (a) D 0 for all a. Thus, to prove that m ¤ is the unique optimal solution, it is sufcient to show that hm ¤ C l, D i ¸ 0. We have to distinguish two cases, depending on whether the term in the truncation brackets is positive for all a: P P ¤ Case 1. Suppose n aP n gn (a) ¸ l for all a 2 V, that is, m (a) D n an gn (a) ¡ l. Then hm ¤ C l, D i D h n an gn , m ¡ m ¤ i D 0 from the lemma 1. P Case 2. Consider the P function m e(a) D d n an gn (a) ¡ le, which has the property m C m ¤ D n an gn ¡l. With this denition and using lemma 1, P that e 0 D h n an gn , m ¡ m ¤ i D he m C m ¤ C l, D i D he m , D i C hm ¤ C l, D i. Then hm ¤ C l, D i D ¡he m , D i, and it is sufcient to show that he m , D i · 0. Let P e(a 2 z ) < 0 z ½ V be the subset of V on which n an gn (a) < l. Then m and m e(a 2/ z) D 0. Since m D m ¤ C D satises m ¸ 0 and by denition
1258
Emanuel Todorov
m ¤ (a 2 z) D 0, we have D (a 2 z) ¸ 0. The dot product he m , D i can be evaluated by parts on the two sets a 2 z and a 2/ z. Since e m (a) D (a) · 0 for a 2 z, and m e(a) D (a) D 0 for a 2/ z, it follows that he m , D i · 0. Acknowledgments I thank Zoubin Ghahramani and Peter Dayan for their in-depth reading of the manuscript and numerous suggestions. References Amirikian, B., & Georgopoulos, A. (2000). Directional tuning proles of motor cortical cells. Neuroscience Research, 36, 73–79. Burke, R. E., Rudomin, P., & Zajac, F. E. (1976). The effect of activation history on tension production by individual muscle units. Brain Research, 109, 515–529. Caminiti, R., Johnson, P., Galli, C., Ferraina, S., & Burnod, Y. (1991). Making arm movements within different parts of space: The premotor and motor cortical representation of a coordinate system for reaching to visual targets. Journal of Neuroscience, 11(5), 1182–1197. Cisek, P., & Scott, S. H. (1998). Cooperative action of mono- and bi-articular arm muscles during multi-joint posture and movement tasks in monkeys. Society for Neuroscience Abstracts, 164.4. Clancy, E., & Hogan, N. (1999). Probability density of the surface electromyogram and its relation to amplitude detectors. IEEE Transactions on Biomedical Engineering, 46(6), 730–739. Coltz, J., Johnson, M., & Ebner, T. (1999). Cerebellar Purkinje cell simple spike discharge encodes movement velocity in primates during visuomotor tracking. Journal of Neuroscience, 19(5), 1782–1803. Conway, B. A., Halliday, D. M., Farmer, S. F., Shahani, U., Maas, P., Weir, A. I., & Rosenberg, J. R. (1995). Synchronization between motor cortex and spinal motoneuronal pool during the performance of a maintained motor task in man. J. Physiol. (Lond.), 489, 917–924. DeLuca, C. J. (1995). Decomposition of the EMG signal into constituent motor unit action potentials. Muscle and Nerve, 18, 1492–1493. Fetz, E. E., & Cheney, P. D. (1980). Postspike facilitation of forelimb muscle activity by primate corticomotoneuronal cells. Journal of Neurophysiology, 44, 751–772. Georgopoulos, A., Kalaska, J., Caminiti, R., & Massey, J. (1982). On the relations between the direction of two-dimensional arm movements and cell discharge in primate motor cortex. Journal of Neuroscience, 2(11), 1527–1537. Gordon, J., Ghilardi, M. F., Cooper, S., & Ghez, C. (1994). Accuracy of planar reaching movements. Exp. Brain Res., 99, 97–130. Harris, C. M., & Wolpert, D. M. (1998).Signal-dependent noise determines motor planning. Nature, 394, 780–784. Herrmann, U., & Flanders, M. (1998). Directional tuning of single motor units. Journal of Neuroscience, 18(20), 8402–8416.
Cosine Tuning Minimizes Motor Errors
1259
Hinton, G. E., McClelland, J. L., & Rumelhart, D. E. (1986). Distributed representations. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing (pp. 77–109). Cambridge, MA: MIT Press. Hoff, B. (1992). A computational description of the organization of human reaching and prehension. Unpublished doctoral dissertation, University of Southern California. Hoffman, D. S., & Strick, P. L. (1999). Step-tracking movements of the wrist. IV. Muscle activity associated with movements in different directions. Journal of Neurophysiology, 81, 319–333. Hogan, N. (1984). An organizing principle for a class of voluntary movements. Journal of Neuroscience, 4(11), 2745–2754. Kalaska, J. F., Cohen, D. A. D., Hyde, M. L., & Prud’homme, M. (1989). A comparison of movement direction–related versus load direction–related activity in primate motor cortex, using a two-dimensional reaching task. Journal of Neuroscience, 9(6), 2080–2102. Kettner, R. E., Schwartz, A. B., & Georgopoulos, A. P. (1988). Primate motor cortex and free arm movements to visual targets in three-dimensional space. III. Positional gradients and population coding of movement direction from various movement origins. Journal of Neuroscience, 8(8), 2938–2947. Kiehn, O., & Eken, T. (1997). Prolonged ring in motor units: Evidence of plateau potentials in human motoneurons? Journal of Neurophysiology, 78, 3061–3068. Lacquaniti, F., Terzuolo, C., & Viviani, P. (1983). The law relating the kinematic and gural aspects of drawing movements. Acta Psychol., 54, 115–130. Lee, D., Port, N. L., Kruse, W., & Georgopoulos, A. P. (1998). Variability and correlated noise in the discharge of neurons in motor and parietal areas of the primate cortex. Journal of Neuroscience, 18(3), 1161–1170. Maltenfort, M. G., Heckman, C. J., & Rymer, Z. W. (1998). Decorrelating actions of Renshaw interneurons on the ring of spinal motoneurons within a motor nucleus: A simulation study. Journal of Neurophysiology, 80, 309–323. McAuley, J. H., Rothwell, J. C., & Marsden, C. D. (1997). Frequency peaks of tremor, muscle vibration and electromyographic activity at 10 Hz, 20 Hz and 40 Hz during human nger muscle contraction may reect rythmicities of central neural ring. Exp. Brain Res., 114, 525–541. Meyer, D. E., Abrams, R. A., Kornblum, S., Wright, C. E., & Smith, J. E. K. (1988). Optimality in human motor performance: Ideal control of rapid aimed movements. Psychological Review, 95, 340–370. Morasso, P. (1981). Spatial control of arm movements. Exp. Brain Res., 42, 223– 227. Nakano, E., Imamizu, H., Osu, R., Uno, Y., Gomi, H., Yoshioka, T., & Kawato, M. (1999). Quantitative examinations of internal representations for arm trajectory planning: Minimum commanded torque change model. Journal of Neurophysiology, 81(5), 2140–2155. Nelson, W. L. (1983). Physical principles for economies of skilled movements. Biological Cybernetics, 46, 135–147. Pouget, A., Deneve, S., Ducom, J.-C., & Latham, P. (1999). Narrow versus wide tuning curves: What’s best for a population code? Neural Computation, 11, 85–90.
1260
Emanuel Todorov
Schmidt, R. A., Zelaznik, H., Hawkins, B., Frank, J. S., & Quinn, J. T. J. (1979). Motor-output variability: A theory for the accuracy of rapid motor acts. Psychological Review, 86(5), 415–451. Snippe, H. (1996). Parameter extraction from population codes: A critical assessment. Neural Computation, 8, 511–529. Stephens, J. A., Harrison, L. M., Mayston, M. J., Carr, L. J., & Gibbs, J. (1999). The sharing principle. In M. D. Binder (Ed.), Peripheral and spinal mechanisms in the neural control of movement (pp. 419–426). Oxford: Elsevier. Sutton, G. G., & Sykes, K. (1967). The variation of hand tremor with force in healthy subjects. Journal of Physiology, 191(3), 699–711. Todorov, E. (2000). Direct cortical control of muscle activation in voluntary arm movements: A model. Nature Neuroscience, 3(4), 391–398. Todorov, E. (2001). Arm movements minimize a combination of error and effort. Neural Control of Movement, 11. Todorov, E., & Jordan, M. I. (1998).Smoothness maximization along a predened path accurately predicts the speed proles of complex arm movements. Journal of Neurophysiology, 80, 696–714. Todorov, E., Li, R., Gandolfo, F., Benda, B., DiLorenzo, D., Padoa-Schioppa, C., & Bizzi, E. (2000). Cosine tuning minimizes motor errors: Theoretical results and experimental conrmation. Society for Neuroscience Abstracts, 785.6. Turner, R. S., Owens, J., & Anderson, M. E. (1995). Directional variation of spatial and temporal characteristics of limb movements made by monkeys in a twodimensional work space. Journal of Neurophysiology, 74, 684–697. Winter, D. A. (1990). Biomechanics and motor control of human movement. New York: Wiley. Zajac, F. E. (1989). Muscle and tendon: Properties, models, scaling, and application to biomechanics and motor control. Critical Reviews in Biomedical Engineering, 17(4), 359–411. Zhang, K., & Sejnowski, T. (1999a). A theory of geometric constraints on neural activity for natural three-dimensional movement. Journal of Neuroscience, 19(8), 3122–3145. Zhang, K., & Sejnowski, T. J. (1999b). Neuronal tuning: To sharpen or broaden? Neural Computation, 11, 75–84. Received March 23, 2000; accepted October 1, 2001.
NOTE
Communicated by Michael Jordan
SMEM Algorithm Is Not Fully Compatible with Maximum-Likelihood Framewor k Akihiro Minagawa
[email protected] Norio Tagawa
[email protected] Toshiyuki Tanaka
[email protected] Graduate School of Engineering, Tokyo Metropolitan University, Hachioji, Tokyo, 192-0397 Japan
The expectation-maximization (EM) algorithm with split-and-merge operations (SMEM algorithm) proposed by Ueda, Nakano, Ghahramani, and Hinton (2000) is a nonlocal searching method, applicable to mixture models, for relaxing the local optimum property of the EM algorithm. In this article, we point out that the SMEM algorithm uses the acceptance-rejection evaluation method, which may pick up a distribution with smaller likelihood, and demonstrate that an increase in likelihood can then be guaranteed only by comparing log likelihoods.
1 Introduction
The expectation maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977) is widely applied in many elds, including pattern and image recognition, as a general solution for maximum likelihood estimation with hidden variables (Jordan & Jacobs, 1994; McLachlan & Basford, 1988). Although the EM algorithm is an iterative method, its advantage over other methods is that it guarantees local optimality. However, the EM algorithm does not guarantee global optimality, and the solution depends on initial parameter values. In order to circumvent this problem in practice, some heuristics are required. Ueda, Nakano, Ghahramani, and Hinton (2000) proposed the split-andmerge EM (SMEM) algorithm for mixture models to overcome the problem of the EM algorithm for local optimality. This method is a heuristic approach in which a nonlocal search for the solution is incorporated into the EM algorithm by a split-and-merge operation applied to three components of a mixture model. c 2002 Massachusetts Institute of Technology Neural Computation 14, 1261– 1266 (2002) °
1262
Akihiro Minagawa, Norio Tagawa, and Toshiyuki Tanaka
In this note, we discuss the SMEM algorithm from a theoretical viewpoint within the framework of maximum-likelihood estimation. We demonstrate that the SMEM algorithm, in its current form, is not fully compatible with the maximum-likelihood framework assumed by the EM algorithm, that is, it may throw away the globally optimum solution even if it is obtained. We then introduce an alternative approach to avoid this problem. It should be mentioned that Bae, Lee, and Lee (2000) apparently used a similar approach, but they did not discuss the problem explicitly or how to cope with it. Therefore, in this article, we make clear what the problem is precisely and how the alternative approach overcomes it. 2 EM Algorithm and SMEM Algorithm for Mixture Model
Given incomplete data, the EM algorithm performs maximization of the likelihood indirectly by increasing the value of the Q-function, which is the conditional expectation of the log-likelihood function of the complete data. Let X be the set of observed variables, Y be the set of hidden variables, and H be the set of model parameters. Then the log-likelihood function, which should be maximized, can be written in marginal form with respect to the hidden variables as Z (2.1) L (HI X D X ) D ln p ( X D X | Y I H) p ( Y I H) dY . In the EM algorithm under appropriate regularity conditions (Wu, 1983), the maximum likelihood estimator can be obtained by making use of the correspondence between the extremum point of equation 2.1 and the stationary point of the iteration of the EM algorithm. The latter is written as O ( t ) D arg max Q (HI H O ( t¡1 ) ) , H
(2.2)
H
O ( t¡1) ) denotes the Q-function, dened as where Q (HI H O ( t¡1) ) ´ E ( Q (HI H pY
O (t¡1) ) [ln p |X DXIH
D X | Y I H) p ( Y I H) ].
(X
(2.3)
The log likelihood and Q-function are related to each other by L (HI X
D X ) D Ep ( Y |X D Ep ( Y |X
O L D XIH) O D XIH)
(HI X
fln p ( X
D X) D X, Y I H) ¡ ln p ( Y | X
O C H (HI H), O D Q (HI H)
D XI H)g
(2.4)
O D ¡E ( where H (HI H) D XI H)] is just the entropy of the O [ln p ( Y | X p Y |X D XIH) conditional distribution of the hidden variables conditioned by the observed
Alternative SMEM Algorithm
1263
O ( t¡1) I X D X ) from L ( H O ( t ) I X D X) gives a difference in data. Subtracting L ( H Q-function values plus a Kullback-Leibler (KL) divergence, which describes the difference between the self- and cross-entropy terms as follows: O ( t) I X L (H
O D fQ( H
O (t¡1) I X D X) ¡ L ( H
( t)
O IH
( t¡1)
O ) ¡ Q(H
( t¡1 )
D X)
O ( t¡1 ) ) g IH
O ( t) I H O ( t¡1 ) ) ¡ H ( H O (t¡1) I H O ( t¡1) ) g C fH ( H
O ( t) I H O ( t¡1 ) ) ¡ Q ( H O ( t¡1) I H O (t¡1) ) D Q(H C D(p (Y |X
O ( t¡1) ) ||p(Y | X D XI H
O ( t) ) ) . D XI H
(2.5)
Since a KL divergence is always nonnegative, the EM iteration guarantees that the likelihood cannot decrease. The SMEM algorithm was proposed to overcome the problem of local optimality in the EM algorithm for mixture models by using a nonlocal search. Here, the SMEM algorithm is described in brief. (For a detailed description refer to Ueda et al., 2000.) First, the ordinary EM algorithm is run, and consequently a stationary point H ¤ , which satises H ¤ D arg max Q (HI H¤ ), is H
obtained as the convergence result. Next, the split-and-merge operation is performed on H ¤ . In this operation, three components are selected out of the mixture components as a split-and-merge candidate, one of which is split into two and other two are merged. Then two steps, called the partial and the full EM steps, are executed in turn; the partial EM step is applied to the above new three components, and the full EM step is applied to all components of a mixture. The full EM step is therefore equivalent to ordinary EM iteration, and consequently a stationary point H ¤¤ , which satises H ¤¤ D arg max Q (HI H¤¤ ) and is different from H¤ , is obtained as the conH
vergence result, using the result of the preceding partial EM step as the initial value. The split-and-merge operation and subsequent EM steps are attempted for Cmax split-and-merge candidates, which are appropriately ranked. When the Q-function value Q (H¤¤ I H ¤¤ ) , corresponding to the new answer H¤¤ , is larger than the Q-function value Q (H ¤ I H ¤ ) corresponding to H ¤ , the answer H ¤¤ is accepted, and the above procedure is iterated using H ¤¤ . If Q (H ¤¤ I H¤¤ ) < Q (H ¤ I H ¤ ) holds for all Cmax answers, H ¤ at that time is returned as the nal result. 3 Problem of SMEM Algorithm and Its Improvement
In the EM algorithm, since the entropy terms are calculated for the same posterior probability, an increase in likelihood is guaranteed because their difference yields a KL divergence, which is always positive. Whereas in the SMEM algorithm, Q-functions, which should be evaluated at the acceptancerejection evaluation, are calculated by taking expectation with respect to
1264
Akihiro Minagawa, Norio Tagawa, and Toshiyuki Tanaka
different posterior probabilities for H¤ and H ¤¤ . This means that if Q (H¤¤ I H ¤¤ ) ¡ Q (H¤ I H ¤ ) D L (H ¤¤ I X D X ) ¡L(H ¤ I X D X ) ¡ fH(H ¤¤ I H¤¤ ) ¡ H (H ¤ I H ¤ ) g > 0
(3.1)
holds, then H ¤¤ is accepted as the new answer. Unlike equation 2.5, the latter subtraction between two self-entropies on the right-hand side of equation 3.1 does not yield a KL divergence. Therefore, this part is not necessarily positive, and an increase in Q-function value does not correspond to an increase in likelihood in the SMEM algorithm. It shows that the evaluation for the acceptance-rejection judgment of the original SMEM algorithm is not fully compatible with the maximum-likelihood estimation framework. The above fact means that the global maximizer of likelihood may be rejected when the entropy of the corresponding conditional distribution is relatively large. From this observation, it is shown that the evaluation method that guarantees an increase in likelihood so as to avoid incorrect rejection of the global maximum by making use of the log likelihood itself in place of the Q-function when judging whether each new answer should be accepted. Since computational cost of entropy is the same order as that of Qfunction used in the original evaluation method, that is, (number of components) £ (number of data), and the log likelihood is easily obtained, the computational complexity of the corrected evaluation method remains the same level as the original evaluation method. Since the difference in the acceptance-rejection judgment between these evaluation methods arises from the entropy term, clearly this difference becomes apparent when the number of mixed components is small or when a major component with a large number of data is split or merged. Otherwise, we expect that there is so little difference between the whole entropy even if the three components have changed that the original SMEM algorithm using Q-function converges near the global optimal. Therefore, the SMEM algorithm with the original evaluation method can be considered as an approximation of the optimal estimation of mixture models, which ignores such a difference, whereas the algorithm with the corrected evaluation method nally returns the value giving the largest likelihood among those encountered during its execution. It should be noted that since the discussion in this article is based on the maximum-likelihood estimation framework assumed by the EM algorithm and is theoretically oriented, the possibility that the original SMEM algorithm with the incompatible evaluation method works well for practical problems is beyond the scope of this article. 4 Numerical Example
Here, a toy example is presented to demonstrate that the inequality relationship of the log likelihood and the value of the Q-function may be re-
Alternative SMEM Algorithm
1265
Table 1: Input Data Used in the Simulation. Component
mN
sN 2
Number
1 2 3
0.0 10.0 60.0
30.0 30.0 30.0
300 300 100
Table 2: Log Likelihood and Q-Function Values of Two Estimated Distributions. Estimated Distribution
Log Likelihood
Q-Function
a b
¡2655.1 ¡2656.8
¡2898.7 ¡2677.8
versed. The EM algorithm is executed using two different initial values for a mixture model with three components as shown in Table 1; two stationary points are obtained. Figure 1 shows the target distribution, the input data histogram, and two estimated mixture distributions (estimated distribution a and b). In estimated distribution a, two components control the left-side peak, whereas in estimated distribution b, one component controls the left-side peak. Table 2 shows the values of both log likelihood and 6 Target distribution Input data Estimated distribution (a) Estimated distribution (b)
Frequency / Distribution
5
4
3
2
1
0 -20
0
20
X
40
60
80
Figure 1: Distribution of input data and estimated distributions by different initial values.
1266
Akihiro Minagawa, Norio Tagawa, and Toshiyuki Tanaka
Q-function of each estimated distribution. Note that the log likelihood is larger for estimated distribution a, whereas the Q-function is larger for estimated distribution b. This implies that if estimated distribution a is generated by performing the split-and-merge operation on estimated distribution b, the distribution a is not accepted by the original SMEM algorithm. On the other hand, using the corrected evaluation method, distribution a is correctly accepted according to log likelihood. Acknowledgments
We thank N. Ueda for his helpful comments. References Bae, U., Lee, T., & Lee, S., (2000). Blind signal separation in teleconferencing using the ICA mixture model. Electronic Letters, 37(7), 680–682. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39, 1–38. Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181–214. McLachlan, G. J., & Basford, K. E. (1988). Mixture models:Inference and applications to clustering. New York: Dekker. Ueda, N., Nakano, R., Ghahramani, Z., & Hinton, G. E. (2000). SMEM algorithm for mixture models. Neural Computation, 12, 2109–2128. Wu, C. F. J. (1983). On the convergence properties of the EM algorithm. Annals of Statistics, 11(1), 95–103. Received April 13, 2001; accepted October 2, 2001.
NOTE
Communicated by Samy Bengio
A Note on the Decomposition Methods for Support Vector Regression Shuo-Peng Liao
[email protected] Hsuan-Tien Lin
[email protected] Chih-Jen Lin
[email protected] Department of Computer Science and Information Engineering, National Taiwan University, Taipei 106, Taiwan
The dual formulation of support vector regression involves two closely related sets of variables. When the decomposition method is used, many existing approaches use pairs of indices from these two sets as the working set. Basically, they select a base set rst and then expand it so all indices are pairs. This makes the implementation different from that for support vector classication. In addition, a larger optimization subproblem has to be solved in each iteration. We provide theoretical proofs and conduct experiments to show that using the base set as the working set leads to similar convergence (number of iterations). Therefore, by using a smaller working set while keeping a similar number of iterations, the program can be simpler and more efcient. 1 Introduction
Given a set of data points, f(x1 , z1 ) , . . . , ( xl , zl )g, such that xi 2 Rn is an input and zi 2 R1 is a target output. A major form for solving support vector regression (SVR) is the following optimization problem (Vapnik, 1998):
min
l l X X 1 (a ¡ a¤ ) T Q (a ¡ a¤ ) C 2 (ai C ai¤ ) C zi (ai ¡ ai¤ ) 2 i D1 iD 1 l X i D1
(ai ¡ ai¤ ) D 0, 0 · ai , ai¤ · C, i D 1, . . . , l,
(1.1)
where C is the upper bound, Qij ´ w ( xi ) T w ( xj ), ai and ai¤ are Lagrange multipliers associated with the ith data xi , and 2 is the error that users can tolerate. Note that training vectors xi are mapped into a higher-dimensional c 2002 Massachusetts Institute of Technology Neural Computation 14, 1267– 1281 (2002) °
1268
Shuo-Peng Liao, Hsuan-Tien Lin, and Chih-Jen Lin
space by the function w . An important property is that for any optimal solution, ai ai¤ D 0, i D 1, . . . , l. Due to the density of Q, currently the decomposition method is the major method to solve equation 1.1 (Smola & Sch¨olkopf, 1998; Keerthi, Shevade, Bhattacharyya, & Murthy, 2000; Laskov, 2002). It is an iterative process wherein each iteration, the index set of variables is separated to two sets, B and N, where B is the working set. In that iteration, variables corresponding to N are xed, while a subproblem on variables corresponding to B is minimized. Following approaches for support vector classication, there are some methods for selecting the working set. Many existing approaches for regression rst use these methods to nd a set of variables, called the base set here; then they expand the base set so all elements are pairs. Here we dene the expanded set as the pair set. For example, if fai , aj¤ g are chosen rst, they include fai¤ , aj g in the working set. Then the following subproblem of four variables (ai , ai¤ , aj , aj¤ ) is solved: min
µ ¶T µ ¶µ ¶ 1 ai ¡ ai¤ Qii Qij ai ¡ ai¤ ¤ ¤ Qji Qjj aj ¡ aj 2 aj ¡ aj ¤ ) C zi ) (ai ¡ ai¤ ) C ( Qi,N (aN ¡ aN ¤) C ( Qj,N (aN ¡ aN C zj ) (aj ¡ aj¤ ) C 2 (ai C ai¤ C aj C aj¤ ) X (ai ¡ ai¤ ) C (aj ¡ aj¤ ) D ¡ (at ¡ at¤ )
(1.2)
t2N
0 · ai , aj , ai¤ , aj¤ · C. ¤ Note that aN and aN are xed elements corresponding to N D ft | 1 · t · 6 i, t D 6 l, t D jg. A reason for selecting pairs is to maintain ai ai¤ D 0, i D 1, . . . , l throughout all iterations. Hence, the number of nonzero variables during iterations can be kept small. However, from Lin (2001a, Theorem 4.1), it has been shown that for some existing work (e.g., Keerthi et al., 2000; Laskov, 2002), if the base set is used as the working set, the property ai ai¤ D 0, i D 1, . . . , l still holds. In section 2, we discuss this in more detail. Recently there have been implementations without using pairs of indices—for example, LIBSVM (Chang & Lin, 2001), SVMTorch (Collobert & Bengio, 2001), and mySVM (Ruping, ¨ 2000). A question immediately raised is on the performance of these two approaches—the base approach and the pair approach. The pair approach solves a larger subproblem in each iteration, so there may be fewer iterations; however, a larger subproblem takes more time so the cost of each iteration is higher. Collobert and Bengio (2001) have stated that working with pairs of variables would force the algorithm to do many computations with null variables until the end of the optimization process. In section 3, we elaborate
Decomposition Methods for Support Vector Regression
1269
on this in a detailed proof. First, we consider approaches with the smallest size of working set (two and four elements for both approaches) where the analytic solution of the subproblem is handily available from sequential minimal optimization (SMO) (Platt, 1998). From mathematical explanations, we show that while solving the subproblems of the pair set containing four variables, in most cases, only those two variables in the base set are updated. Therefore, the number of iterations of both the base and the pair approaches is nearly the same. In addition, for larger working sets, we prove that after some nite number of iterations, the subproblem using only the base set is already optimal for the subproblem using the pair set. These give us theoretical justication that it is not necessary to use pairs of variables. In section 4, we conduct experiments to demonstrate the validity of our analysis. Then in section 5, we provide some conclusions and a discussion. There are other decomposition approaches for support vector regression (for example, Flake & Lawrence, 2002). They deal with different situations, which will not be discussed here. 2 Working Set Selection
Here we consider the working set selection from Joachims (1998) and Keerthi, Shevade, Bhattacharyya, and Murthy (2001) which were originally designed for classication. We then apply them for SVR. To make SVR similar to the form of classication, we dene the following 2l by 1 vectors: µ a (¤) ´
¶ » C1 a and ´ , y i a¤ ¡1,
i D 1, . . . , l, i D l C 1, . . . , 2l.
(2.1)
Then the regression problem, 1.1, can be reformulated as min f (a (¤) ) D
µ 1 (¤) T Q (a ) ¡Q 2
0 · ai(¤) · C,
yT a
(¤)
D 0.
¶ ¤ ¡Q (¤) £ T a C 2 e C zT , 2 eT ¡ zT a (¤) Q i D 1, . . . , 2l,
(2.2)
Now f is the objective function of equation 2.2. Practically we can use the Karush-Kuhn-Tucker (KKT) condition to test if a given a (¤) is an optimal solution of equation 2.2. If there exists a number b such that for all i D 1, 2, . . . , 2l,
r f (a (¤) ) i C byi ¸ 0 r f (a (¤) ) i C byi · 0 r f (a (¤) ) i C byi D 0
if ai(¤) D 0, if
ai(¤)
D C,
if 0 < ai(¤) < C,
(2.3a) (2.3b) (2.3c)
a feasible a (¤) is optimal for equation 2.2. Note that the range of b can be
1270
Shuo-Peng Liao, Hsuan-Tien Lin, and Chih-Jen Lin
determined by m (a (¤) ) ´ max (
max
1 ·t ·2l a(¤) < C,yt D1 t at(¤)
M (a (¤) ) ´ min (
¡yt r f (a (¤) ) t ,
max
> 0,yt D ¡1
min
1 ·t ·2l a(¤) > 0,yt D 1 t
min
¡yt r f (a (¤) ) t ) ,
(2.4)
¡yt r f (a (¤) ) t ,
at(¤) < C,yt D ¡1
¡yt r f (a (¤) ) t ) .
(2.5)
That is, a feasible a (¤) is an optimal solution if and only if m (a (¤) ) · b · M (a (¤) ) , or equivalently, M (a (¤) ) ¡ m (a (¤) ) ¸ 0.
(2.6)
For convenience, we dene the candidates of m (a (¤) ) as the set of all (¤) (¤) indices t that satisfy at ,k < C, yt D 1 or at ,k > 0, yt D ¡1, where 1 · t · 2l. Similarly, we can dene the candidates of M (a (¤) ). At the beginning of iteration k, let a (¤) , k D [ak, (a¤ ) k]T be the vector that we are working on. Then we denote m k ´ m (a (¤) , k ) and M k ´ M (a (¤) ,k ) . Also, let arg m k be the subset of indices t in the candidates of m k such that ¡yt r f (a (¤) , k ) t D m (a (¤) , k ) . Similarly, we dene arg M k. Thus, during iterations of the decomposition method, a (¤) , k is not optimal yet, so m k > M k, for all k.
(2.7)
If we would like to select two elements as the working set, intuitively we tend to choose indices i and j, which satisfy i 2 arg m k and j 2 arg M k,
(2.8)
since they cause the maximal violation of the KKT condition. A systematic way to select a larger working set in each iteration is as follows. If q, an even number, is the size of the working set, q / 2 indices are sequentially selected from the largest ¡yi r f (a (¤) ) i values to the smaller ones in the candidate set of mk, that is, ¡yi1 r f (a (¤) ,k ) i1 ¸ ¢ ¢ ¢ ¸ ¡yiq/ 2 r f (a (¤) , k ) iq/ 2 , where i1 2 arg m k. The other q / 2 indices are sequentially selected from the smallest ¡yi r f (a (¤) ) i values to the larger ones in the candidate set of M k, that is, ¡yjq /2 r f (a (¤) ,k ) jq /2 ¸ ¢ ¢ ¢ ¸ ¡yj1 r f (a (¤) , k ) j1 ,
Decomposition Methods for Support Vector Regression
1271
where j1 2 arg M k. Also, we have ¡ yjq /2 r f (a (¤) , k ) jq/ 2 < ¡yiq / 2 r f (a (¤) ,k ) iq / 2 ,
(2.9)
to ensure that the intersection of these two groups is empty. Thus, if q is large, sometimes the actual number of selected indices may be less than q. Note that this is the same as the working set selection in Joachims (1998). However, the original derivation in Joachims was from the concept of feasible directions for constrained optimization problems but not from the violation of the KKT condition. After the base set of q indices is selected, earlier approaches (Keerthi et al., 2000; Laskov 2002) expand the set so all elements in it are pairs. The reason is to keep the property that aik (a¤ ) ik D 0, i D 1, . . . , l, for all k. However, if directly using elements in the base set, the following theorem has been proved in (Lin 2001a, theorem 4.1): Theorem 1.
all k.
If the initial solution is zero, then aik (a¤ ) ik D 0, i D 1, . . . , l for
Hence we know that aik (a¤ ) ik D 0 is not a particular advantage of using pairs of indices. Another important issue for the decomposition method is the stopping criterion. From equation 2.6, a natural choice of the stopping criterion is M k ¡ m k ¸ ¡d,
(2.10)
where d, the stopping tolerance, is a small, positive number. For q D 2, equation 2.10 is the same as ¡ yj r f (a (¤) ,k ) j ¡ (¡yi r f (a (¤) ,k ) i ) ¸ ¡d,
(2.11)
where i, j are selected by equation 2.8. Note that the convergence of the decomposition method under some conditions of the kernel matrix Q is shown in Lin (2001a) for the base approach. Some theoretical justication on the use of the stopping criteria, equation 2.10, for the decomposition method is in Lin (2001b). No particular convergence proof has been made for the pair approach, but we will assume it for our analyses. 3 Number of Iterations
In this section, we discuss the relationship between the solutions of subproblems using the base and the pair approaches. The discussion is divided into two parts. First, we consider approaches with the smallest size of working
1272
Shuo-Peng Liao, Hsuan-Tien Lin, and Chih-Jen Lin
set (two and four elements for both approaches). We show that for most iterations, the optimal solution of the subproblem using the base set is already optimal for the subproblem using the pair set, so the difference between the number of iterations of the base and pair approaches should not be large. Then we consider larger working sets. Although the result is not as elegant as the rst part, we can still show that after enough iterations, the optimal solution of the subproblem using the base set is the same as the optimal solution of the subproblem using the pair set. To start our proof, we state an important property on the difference between the ith and (i C l) th gradient elements. Consider ai and ai¤ , 1 · i · l. We have
r f (a (¤) ) i C l D ¡(Q(a ¡ a¤ ) ) i C 2 ¡ zi (¤) D ¡r f (a ) i C 22 . Note that yi D 1 and yi C l D ¡1 as dened in equation 2.1, so ¡yi C l r f (a (¤) ) i C l D ¡yi r f (a (¤) ) i C 22 .
(3.1)
We will use this in later analyses. When q D 2, the base set is selected from equation 2.8. It is easy to see that indices i and i C l where 1 · i · l cannot both be chosen at the same time. For example, if indices i and i C l are both selected from equation 2.8, by equations 3.1 and 2.7, we must have i C l 2 arg m k and i 2 arg M k. By (¤) (¤) equations 2.4 and 2.5, this means ai C l, k > 0 and ai , k > 0, which violates theorem 1. Therefore, ( i, i C l) cannot be chosen together. (¤) (¤) Also, if one of ai , k and ai C l, k is selected, the other must be zero. We
can prove this by contradiction. Without loss of generality, say if ai , k is (¤) (¤) selected and ai C l, k is nonzero, then by theorem 1, ai , k must be zero. So by equations 2.4, 2.5, and 2.8, i 2 arg m k. Moreover, by equation 2.4, both i and i C l are in the candidate set of mk, and we must have ¡yi r f (a (¤) , k ) i ¸ ¡yi C l r f (a (¤) , k ) i C l since i 2 arg mk. But this contradicts equation 3.1. There(¤) (¤) fore, if one of ai , k and ai C l, k is selected, the other must be zero. If any one of (i, j), ( i, j C l) , ( i C l, j), (i C l, j C l) where 1 · i, j · l is chosen from equation 2.8, our goal is to see the difference in solving the two-variable subproblem and the four-variable subproblem of ( i, j, i C l, j C l) . Without loss of generality, we consider the case where ( i, j) is chosen by (¤) (¤) (¤) (¤) equation 2.8. Then (ai , k, aj ,k, ai C l, k, aj C l, k ) are the corresponding variables (¤)
at iteration k; from earlier discussions, we know ai C l,k D aj C l,k D 0. After (¤)
(¤) (¤) and aj is solved, we assume the new (¤) , k aN j , 0, 0). From equation 3.1 and the KKT condition, it (¤) (¤) (¤) (¤) if aN i , k > 0 and aN j , k > 0, (aN i , k, a N j ,k, 0, 0) is already an
a two-variable subproblem on ai values are
(¤) , k (a Ni ,
(¤)
is easy to see that optimal solution of the four-variable problem, equation 1.2.
Decomposition Methods for Support Vector Regression
1273
Figure 1: Possible situation of plane changes.
Therefore, the only difference happens when there is a “jump” from the ( i, j) plane to another plane of two variables and the objective value of equation 1.2 can be further decreased. We illustrate this in Figure 1. In this gure, each square represents a plane of two nonzero variables. From the linear constraint (¤)
ai(¤) C aj
D ¡
X
(¤)
yt at ,
(3.2)
6 i, j tD
the two dashed parallel lines in Figure 1 show how the solution plane pos(¤) ,k (¤) , k (¤) ,k sibly changes. For example, after a becomes zero, if ( a Nj Ni , a N j , 0, 0) is not an optimal solution of equation 1.2, we may further reduce its objective value by entering the ( i, j¤ ) plane. We will check under what conditions (¤) ,k (¤) , k (a N i , aN j , 0, 0) is not an optimal solution of equation 1.2. (¤)
(¤)
Since ai and aj are adjusted on the line, equation 3.2, we consider the objective value of the subproblem on the (i, j) plane as the following 6 6 function of a single variable v, where N D ft | 1 · t · l, t D i, t D jg are indices of the xed variables: 1 h (¤) , k Cv ai g ( v) ´ 2
(¤) aj , k
iµ Qii ¡v Qji
Qij Qjj
¶"
# (¤) ai ,k C v (¤) aj ,k ¡ v
k k ) C C ) (a (¤) ,k C ) C ( Qi,N (aN ¡ (a¤ ) N 2 zi i v
1274
Shuo-Peng Liao, Hsuan-Tien Lin, and Chih-Jen Lin (¤) , k
k k ) C C ) (a C ( Qj,N (aN ¡ (a¤ ) N 2 zj j
D (¤)
Since ai
¡ v)
1 ( Qii ¡ 2Qij C Qjj ) v2 C ( r f (a (¤) ,k ) i ¡ r f (a (¤) , k ) j ) v C constant. 2 (¤) , k
is increased from ai
(¤) , k
to aN i
, we know
g0 (0) D r f (a (¤) , k ) i ¡ r f (a (¤) ,k ) j < 0. (¤) , k
Now if ( aN i
(¤) , k
, aN j
, 0, 0) is not an optimal solution of equation 1.2, we can (¤) , k
dene a new function gN ( v) similar to g ( v) at ( aN i (¤) If aN i , k can be further increased,
, 0) of the ( i, j C l) plane.
(¤) (¤) gN 0 (0) D r f ( aN , k ) i C r f ( a N ,k ) j C l < 0. (¤) ,k
However, from equation 3.1 and a Ni
(¤) , k
¡ ai
(3.3) (¤) , k
D ¡( a Nj
(¤) , k
¡ aj
),
r f ( aN (¤) , k ) i C r f ( aN (¤) , k ) j C l D r f (a N D r f (a
(¤) ,k (¤) ,k
) i ¡ r f (aN (¤) , k ) j C 22
(¤) (¤) ) i C Qii (aN i(¤) , k ¡ ai(¤) , k ) C Qij ( a N j ,k ¡ aj , k )
(¤) (¤) ¡ (r f (a (¤) , k ) j C Qji (aN i(¤) , k ¡ ai(¤) , k ) C Qjj (aN j , k ¡ aj , k ) ) C 22
(¤) (¤) D ( r f (a , k ) i ¡ r f (a , k ) j )
C (a N i(¤) , k ¡ ai(¤) , k ) (Qii ¡ 2Qij C Qjj ) C 22 .
(3.4)
Since Q is positive semidenite, Qii Qjj ¡ Q2ij ¸ 0 implies Qii ¡ 2Qij C Qjj ¸ 0. (¤) , k
With aN i
(¤) , k
¡ ai
¸ 0 and equation 3.3, we know if
( r f (a (¤) ,k ) i ¡ r f (a (¤) , k ) j ) C (aN i(¤) , k ¡ai(¤) , k ) ( Qii ¡2Qij C Qjj ) C 22 ¸ 0, (3.5) (¤) it is impossible to move ( a N i , k, 0) on ( i, j C l) plane further. That is, (¤) (aN i(¤) , k, a N j ,k, 0, 0) is already an optimal solution of equation 1.2. For other cases, that is, (i, j C l) , ( i C l, j) , and ( i C l, j C l) , results are the same. Note that now, 1 · i, j · l so r f (a (¤) , k ) i ¡ r f (a (¤) ,k ) j is actually the value obtained in equation 2.11. That is, it is the number used for checking the stopping criterion. Therefore, we have the following theorem:
For all iterations with the violation on the stopping criterion, equation 2.11, no more than 22 , an optimal solution of the two-variable subproblem is already an optimal solution of the corresponding four-variable subproblem. Theorem 2.
If 2 is not small, the stopping tolerance in most iterations is smaller than 22 . In addition, as most decomposition iterations are spent in the nal stage
Decomposition Methods for Support Vector Regression
1275
due to slow convergence, this theorem has shown a conclusive result that no matter whether the two-variable or four-variable approach is used, the difference on the number of iterations should not be much. For larger working set (q > 2), we may not be able to get results as elegant as theorem 2. When q D 2, for example, we know exactly the relation on the (¤) (¤) (¤) ,k (¤) , k (¤) , k (¤) changes of ai , k and aj , k in one iteration, as a ¡aj , k ). N i ¡ai D ¡( a Nj However, when q > 2, the change on each variable can be different. In the following, we will show that if fa (¤) ,kg is an innite sequence, during nal iterations, after k is large enough, solving the subproblem of the base set is the same as solving the larger subproblem of the pair set. Next, we describe some properties that will be used for the proof. Assume that the sequence fa (¤) ,kg of the base approach converges to an optimal solution a O (¤) . Then we can dene O ´ M ( aO (¤) ) and mO ´ m ( a O (¤) ) . M
(3.6)
We also note that equation 2.9 implies that for any index 1 · i · 2l in the working set of the kth iteration, M k · ¡yi r f (a (¤) , k ) i · m k.
(3.7)
Now we describe two theorems from Lin (2001b) that are needed for the main proof. These theorems deal with a general framework of decomposition methods for different SVM formulations. We can easily check that the base approach satises the required conditions of these two theorems so they can be applied: Theorem 3.
lim m k ¡ M k D 0.
(3.8)
k!1
(¤)
O (¤) ) i is For any aO i , 1 · i · 2l, whose corresponding ¡yi r f ( a (¤) (¤) , k O after k is large enough, ai O nor M, Oi . neither m is at a bound and is equal to a Theorem 4.
Immediately, we have a corollary of theorem 3, which is specic to SVR: (¤) , k
After k is large enough, for all i D 1, 2, . . . , l, ai would not be both selected in the base working set. Corollary 1.
and ai C l, k (¤)
Proof. By the convergence of m k ¡M k to 0, after k is large enough, m k ¡M k < 2 . (¤) (¤) If there exists 1 · i · l such that ai , k and ai C l,k are both selected in
the base working set, from equation 3.7, M k · ¡r f (a (¤) , k ) i · mk and
1276
Shuo-Peng Liao, Hsuan-Tien Lin, and Chih-Jen Lin
M k · r f (a (¤) , k ) i C l · mk. However, equation 3.1 shows r f (a (¤) ,k ) i C l D ¡r f (a (¤) , k ) i C 22 so m k ¡ M k ¸ 22 , and there is a contradiction. Next, we describe the main proof of this section, which is an analysis on the innite sequence fa (¤) ,kg. O D 6 O C 22 . After k is large enough, any optiWe assume that M m mization subproblem of the base set is already optimal for the larger subproblem of the pair set. Theorem 5.
Proof. If the result is wrong, there is an index 1 · i · l and an innite set
K such that for all k 2 K , ai , k (or ai C l,k) is selected in the working set, but (¤) (¤) then ai C l, k (or ai ,k) is also modied. Without loss of generality, we assume (¤)
(¤)
that ai , k is selected in the working set, but ai C l, k is modied innite times. So by theorem 4, (¤)
(¤)
O O or r f (aO (¤) ) i C l D M. r f ( aO (¤) ) i C l D m, By equation 3.1, O ¡ 22 < M. O O or ¡ r f (aO (¤) ) i D M ¡r f ( a O (¤) ) i D mO ¡ 22 < m, O ¡ 22 , we have m 6 O D O < For the second case, by the assumption that m M (¤) ) (¤) ) (¤) ) ( ( ( O > ¡r f aO O < ¡r f a O < ¡r f ( aO (¤) ) i ¡r f aO O i or m i . But if m i , we have m O which is impossible for an optimal solution. Hence, < M, ¡ yi r f ( a O (¤) ) i < mO
(3.9)
holds for both cases. Therefore, we can dene
D ´ min(2 / 2, (mO ¡ (¡yi r f ( aO (¤) ) i ) ) / 3) > 0. By the convergence of the sequence f¡yj r f (a (¤) ,k ) j g to ¡yj r f ( a O (¤) ) j , for all j D 1, . . . , 2l, after k is large enough, |yj r f (a (¤) , k ) j ¡ yj r f (a (¤) ,kC 1 ) j | · D and |yj r f (a
(¤) , k
) j ¡ yj r f ( a O
(¤)
)j | · D .
(3.10) (3.11)
Suppose that at the kth iteration, j 2 arg M k is selected in the working set and ¡yj r f (a (¤) ,k ) j D M k. By equations 3.7, 3.10, 3.11, and 3.9, ¡yj r f ( a O (¤) ) j · ¡yj r f (a (¤) ,k ) j C D D M k C D
Decomposition Methods for Support Vector Regression
1277
· ¡yi r f (a (¤) , k ) i C D · ¡yi r f ( a O (¤) ) i C 2D
O ¡ (¡yi r f (aO (¤) ) i ) ) / 3 · ¡yi r f ( a O (¤) ) i C 2( m O O · M. < m
(¤) , k
From theorem 4 and equation 3.12, after k is large enough, aj and is equal to (¤) ,k
and aj
(¤) a Oj .
That is,
(¤) aj ,k
D
(¤) C 1 aj , k
D
(¤) a Oj .
Since
(3.12) is at a bound
(¤) C 1 aj , k
(¤) , k
D aj
is in the candidates of M k , by equations 3.7, 3.10, and 3.11,
(¤) M kC 1 · ¡yj r f (a , kC 1 ) j
· ¡yj r f (a (¤) , k ) j C D D M k C D · ¡yi r f (a (¤) , k ) i C D · ¡yi r f (a
(¤) , k C 1
) i C 2D .
Hence, we get M kC 1 · ¡yi r f (a (¤) , kC 1 ) i C 2 .
(3.13)
On the other hand, ai , k is modied to ai strictly positive. By the denition of m k, (¤)
(¤) , k C 1
, so at least one of them is
r f (a (¤) , k ) i C l · m k, or r f (a (¤) , kC 1 ) i C l · mkC 1 .
(3.14)
From equations 3.1, 3.7, 3.13, and 3.14, for all large enough k 2 K , M k · m k ¡ 22 , or M kC 1 · m kC 1 ¡ 2 . Therefore, 6 0 lim m k ¡ M k D
k!1
which contradicts theorem 3. 4 Experiments
We consider two regression problems, abalone (4177 data) and add10 (9792 data), from Blake and Merz (1998) and Friedman (1988), respectively. The RBF kernel is used: Qij ´ e ¡c kxi ¡xj k . 2
Since our purpose is not to examine the quality of the solutions, we do not perform model selection on the value of c . Instead, we x it to 1 / n, where n is the number of attributes in each data. For these two problems, n is
1278
Shuo-Peng Liao, Hsuan-Tien Lin, and Chih-Jen Lin
Table 1: Problem abalone. Iteration (2-variable)
Iteration (4-variable)
SV
Candidatesa
Jumpsb
C D 10, 2 D 0.1 C D 10, 2 D 1 C D 10, 2 D 10
18,930 10,173 17
18,790 10,173 17
3967 2183 7
8438 1705 0
14 0 0
C D 100, 2 D 0.1 C D 100, 2 D 1 C D 100, 2 D 10
142,190 122,038 261
140,686 119,981 261
3938 2147 9
63,057 10,187 0
207 48 0
Parameters
Notes: a The number of iterations that violate conditions of Theorem 2. b The number of plane changes as illustrated in Figure 1.
Table 2: Problem add10. Iteration (2-variable)
Iteration (4-variable)
SV
Candidates
Jumps
C D 10, 2 D 0.1 C D 10, 2 D 1 C D 10, 2 D 10
28,625 22,067 116
28,955 21,913 116
9254 4997 14
13,918 2795 0
1 1 0
C D 100, 2 D 0.1 C D 100, 2 D 1 C D 100, 2 D 10
350,344 227,604 105
350,325 227,604 105
9158 4265 11
109,644 14,628 0
12 0 0
Parameters
eight and ten, respectively. Based on our experience, we think that it is an appropriate value when data are scaled to [¡1, 1]. Tables 1 and 2 present results using different 2 and C on two problems. We consider 2 to be greater than 0.1 because for smaller 2 , the number of support vectors approaches the number of training data. On the other hand, we consider 2 up to 10 where the number of support vectors is close to zero. For each parameter set, we present the number of iterations by both twovariable and four-variable approaches, number of support vectors, number of iterations that violate conditions of theorem 2 (i.e., possible candidates of jumps), and the number of real jumps as illustrated in Figure 1 when using the four-variable approach. The solution of the four-variable approach is obtained as follows. First a two-variable problem obtained from equation 2.8 is solved. If there is at least one variable that goes to zero, another two-variable problem has to be solved. As indicated in Figure 1, at most three two-variable problems are needed. For our experiments, both versions of the code are directly modied from LIBSVM (version 2.03). It can be clearly seen that both approaches take nearly the same number of iterations. In addition, the number of jumps while using the four-variable
Decomposition Methods for Support Vector Regression
1279
Table 3: Problem abalone (First 200 Data). q D 10 Parameters
Iteration Iteration (base) (pairs)
q D 20 NAVMa
Iteration Iteration (base) (pairs)
NAVMa
C D 10, 2 D 0.1 C D 10, 2 D 1 C D 10, 2 D 10
132 49 1
173 46 1
77 3 0
100 42 1
114 40 1
111 10 0
C D 100, 2 D 0.1 C D 100, 2 D 1 C D 100, 2 D 10
1641 807 1
1983 552 1
184 29 0
1168 258 1
1086 310 1
250 46 0
Note: a Number of added variables that are modied for the pair approach.
approach is very small, especially when 2 is larger. Furthermore, the number of “candidates” is much larger than the number of real jumps. This (¤) ,k (¤) means that ( a ¡ ai , k ) (Qii ¡ 2Qij C Qjj ) of equation 3.4 is large enough Ni so equation 3.5 is usually satised. Next, we experiment with using larger working sets. We use a simple implementation written in MATLAB so only small problems are tested. We consider the rst 200 data points of abalone and add10. Results are in Tables 3 and 4, where we show the number of iterations and the number of added variables that are modied (NAVM) for the pair approach. Note that we use “number of added variables that are modied” instead of “number of jumps” since for larger subproblems, we cannot model the change of variables as jumps between planes. The NAVM column is dened as follows. If the pair approach is used, the working set in the kth iteration is the union of two sets: B k, which is the (¤) base set, and its extension, BN k. We check the value of variables in a N before Bk and after solving the subproblem. NAVM is the sum of the count for those (¤) modied variables in a N throughout all iterations, so it is at most the sum Bk of | BN k | throughout all iterations, which is roughly the number of iterations multiplied by the maximal size of the base set, q. In Tables 3 and 4, the column NAVM is relatively small compared to the number of iterations multiplied by q. That means that in nearly all the optimization steps, only variables corresponding to the base set rather than the extended set are changed. In addition, from Tables 1 through 4, we found that the pair approach may not lead to fewer iterations, so it is not necessary to use the pair approach for solving SVR.
1280
Shuo-Peng Liao, Hsuan-Tien Lin, and Chih-Jen Lin
Table 4: Problem add10 (First 200 Data). q D 10 Parameters
Iteration (base)
Iteration (pairs)
NAVMa
59 55 2
50 60 2
13 0 0
27 18 2
26 18 2
33 0 0
2519 944 2
2094 956 2
112 12 0
1317 236 2
1216 278 2
135 28 0
C D 10, 2 D 0.1 C D 10, 2 D 1 C D 10, 2 D 10 C D 100, 2 D 0.1 C D 100, 2 D 1 C D 100, 2 D 10
q D 20 Iteration Iteration (base) (pairs)
NAVMa
Note: a Number of added variables modied for the pair approach.
5 Conclusion and Discussion
From our theoretical proofs, we show that in the nal iterations of the decomposition methods, the solution of the subproblem for the base approach is the same as that for the pair approach. This means that extending the base working set to the pair set will not benet much, so it is not necessary to use the pair method. Experiments conrmed our analysis. The difference between number of iterations of the two approaches is negligible. Moreover, the pair approach solves a larger optimization subproblem in each iteration, which costs more time, so the program using the base approach is more efcient. We mentioned in equation 2.2 that the regression problem can be reformulated to have the same structure as the classication problem, so if we solve SVR using the base approach, it is possible to use the same program for classication with little modication. For example, LIBSVM used this strategy. However, if equation 2.2 is directly applied without using as many regression properties as possible, our experience shows that the performance may be a little worse than software specially designed for regression. Acknowledgments
This work was supported in part by the National Science Council of Taiwan, grant NSC 89-2213-E-002-106. References Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases (Tech. Rep.). Irvine: University of California, Department of Information and Computer Science. Available on-line: http://www.ics.uci.edu/»mlearn/ MLRepository.html.
Decomposition Methods for Support Vector Regression
1281
Chang, C.-C., & Lin, C.-J. (2001). LIBSVM: A library for support vector machines. [Computer software]. Available on-line: http://www.csie.ntu.edu.tw/ »cjlin/libsvm. Collobert, R., & Bengio, S. (2001). SVMTorch: A support vector machine for large-scale regression and classication problems. Journal of Machine Learning Research, 1, 143–160. Available on-line: http://www.idiap.ch/ learning/SVMTorch.html. Flake, G. W., & Lawrence, S. (2002). Efcient SVM regression training with SMO. Machine Learning, 46, 271–290. Friedman, J. (1988). Multivariate adaptive regression splines (Tech. Rep. No. 102). Stanford, CA: Laboratory for Computational Statistics, Department of Statistics, Stanford University. Joachims, T. (1998). Making large-scale SVM learning practical. In B. Sch¨olkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods—support vector learning. Cambridge, MA: MIT Press. Keerthi, S. S., Shevade, S., Bhattacharyya, C., & Murthy, K. (2000).Improvements to SMO algorithm for SVM regression. IEEE Transactions on Neural Networks, 11, 1188–1193. Keerthi, S. S., Shevade, S., Bhattacharyya, C., & Murthy, K. (2001).Improvements to Platt’s SMO algorithm for SVM classier design. Neural Computation, 13, 637–649. Laskov, P. (2002). An improved decomposition algorithm for regression support vector machines. Machine Learning, 46, 315–350. Lin, C.-J. (2001a). On the convergence of the decomposition method for support vector machines. IEEE Transactions on Neural Networks, 12, 1288–1298. Lin, C.-J. (2001b). Stopping criteria of decomposition methods for support vector machines: a theoretical justication (Tech. Rep.) Taipei, Taiwan: Department of Computer Science and Information Engineering, National Taiwan University. To appear in IEEE Transactions on Neural Networks. Platt, J. C. (1998). Fast training of support vector machines using sequential minimal optimization. In B. Sch¨olkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernal methods—support vector learning. Cambridge, MA: MIT Press. Ruping, ¨ S. (2000). mySVM—another one of those support vector machines. [Computer software]. Available on-line: http://www-ai.cs.unidortmund.de/ SOFTWARE/MYSVM/. Smola, A. J., & Sch¨olkopf, B. (1998). A tutorial on support vector regression. (Neuro COLT Tech. Rep. TR-1998-030). Egham, Surrey: Royal Holloway College. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Received January 4, 2001; accepted October 10, 2001.
LETTER
Communicated by Thomas Bartol
An Image Analysis Algorithm for Dendritic Spines Ingrid Y. Y. Koh
[email protected] Department of Applied Mathematics and Statistics, State University of New York at Stony Brook, Stony Brook, NY 11794-3600, U.S.A., and Howard Hughes Medical Institute and Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, U.S.A. W. Brent Lindquist
[email protected] Department of Applied Mathematics and Statistics, State University of New York at Stony Brook, Stony Brook, NY 11794-3600, U.S.A. Karen Zito
[email protected] Howard Hughes Medical Institute and Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, U.S.A. Esther A. Nimchinsky
[email protected] Howard Hughes Medical Institute and Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, U.S.A. Karel Svoboda
[email protected] Howard Hughes Medical Institute and Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, U.S.A. The structure of neuronal dendrites and their spines underlie the connectivity of neural networks. Dendrites, spines, and their dynamics are shaped by genetic programs as well as sensory experience. Dendritic structures and dynam ics may therefore be important predictors of the function of neural networks. Based on new imaging approaches and increases in the speed of computation, it has become possible to acquire large sets of high-resolution optical micrographs of neuron structure at length scales small enough to resolve spines. This advance in data acquisition has not been accompanied by comparable advances in data analysis techniques; the analysis c 2002 Massachusetts Institute of Technology Neural Computation 14, 1283– 1310 (2002) °
1284
Ingrid Y. Y. Koh, et al.
of dendritic and spine morphology is still accomplished largely manually. In addition to being extremely time intensive, manual analysis also introduces systematic and hard-to-characterize biases. We present a geometric approach for automatically detecting and quantifying the threedimensional structure of dendritic spines from stacks of image data acquired using laser scanning microscopy. We present results on the measurement of dendritic spine length, volume, density, and shape classication for both static and time-lapse images of dendrites of hippocampal pyramidal neurons. For spine length and density, the automated measurements in static images are compared with manual measurements. Comparisons are also made between automated and manual spine length measurements for a time-series data set. The algorithm performs well compared to a human analyzer, especially on time-series data. Automated analysis of dendritic spine morphology will enable objective analysis of large morphological data sets. The approaches presented here are generalizable to other aspects of neuronal morphology. 1 Introduction
Recent experiments have revealed that important aspects of cognitive function, such as experience-dependent plasticity (Engert & Bonhoeffer, 1999; Maletic-Savatic, Malinow, & Svoboda, 1999; Toni, Buchs, Nikonenko, Bron, & Muller, 1999; Lendvai, Stern, Chen, & Svoboda, 2000), neural integration (Yuste & Denk, 1995; Wearne, Straka, & Baker, 2000), and learning and memory (Moser, Trommald, & Anderson, 1994) are correlated with variations in dendritic branching morphology and with spine density and distribution. Similarly, age-related decits in short-term memory (Duan, He, Wicinski, Morrison, & Hof, 2000), important forms of neural dysfunction (Horner, 1993), and mental retardation (Purpura, 1974) have been localized, in part, to dendrites and spines. These ndings have motivated extensive efforts to obtain quantitative descriptions of dendritic and spine morphologies, both statically and dynamically. Due to its superior resolution capability in revealing ultrastructures at synaptic junctions, serial section electron microscopy (SSEM) has been used (Fiala, Feinberg, Popov, & Harris, 1998; Toni et al., 1999) to quantify dendritic spine structures in three dimensions (3D). This is, however, a nonvital form of observation and an extremely labor-intensive histological approach requiring manual or semiautomatic registration and outlining of the structures on each serial section (Carlbom, Terzopoulos, & Harris, 1994; Spacek, 1994). Modern uorescence microscope methods, such as confocal laser scanning microscopy (CLSM) (Pawley, 1995) and two-photon excitation laser scanning microscopy (2PLSM) (Denk, Strickler, & Webb, 1990; Denk & Svoboda, 1997; Svoboda, Denk, Kleinfeld, & Tank, 1997), offer many advantages
An Image Analysis Algorithm
1285
over SSEM, at the expense of reduced resolution. Sectioning is achieved by limiting the detection (CLSM) or excitation (2PLSM) of uorescence to a sub-femtoliter focal volume. Optical imaging is rapid and noninvasive. The exquisite selectivity of uorescence allows the detection of even single molecules against a background of billions of others (Eigen & Rigler, 1994). Optical microscopy thus occupies a unique niche in biology due to its ability to perform observations in intact, living tissue at relatively high resolution. The properties of uorescence microscopy images are well understood. To image neuronal structure, neurons are labeled with uorescent molecules that ll the cytoplasm homogeneously. Voxel values report the convolution of the density of uorescent probes with the point-spread function (PSF) of the imaging system, which is essentially equal to the focal volume and is easily measured (Svoboda, Tank, & Denk, 1996). Studies of morphological plasticity based on CLSM (Moser et al., 1994; Hosokawa, Rusakov, Bliss, & Fine, 1995) and 2PLSM (Engert & Bonhoeffer, 1999; Maletic-Savatic et al., 1999; Lendvai et al., 2000) measurements of spine length and density have been described. Despite these advances in modern imaging techniques, the analysis of neuronal structure has remained largely manual. The considerable amount of time and effort required for manually extracting spine measurements has precluded routine studies of large amounts of data. In addition, results are not precisely reproducible as accuracy is dependent on the skill and habituation of the user. A few detection and estimation techniques (Rusakov & Stewart, 1995; Watzel, Braun, Hess, Scheich, & Zuschratter, 1995; Herzog et al., 1997; Kilborn & Potter, 1998) of varying degrees of automation have been suggested to overcome the tedium and improve accuracy and reproducibility of the result, none of which has apparently been used and veried on large data sets. Rusakov and Stewart (1995) applied a medial axis construction (skeletonization) to 2D dendritic images to obtain spine length measurements in 2D and estimated the corresponding 3D measurements using a stereological sampling procedure. As the medial axis is sensitive to surface features, manual screening of the medial axis was required to select among spine, dendrite, and irregular surface-induced features (i.e., artifacts). Since measurements were based solely on the medial axis, no volumetric estimates were obtainable. Watzel et al. (1995) have also suggested the detection of dendritic spines using medial axis-based identication. Their 3D algorithm was restricted to images containing a single dendrite. The dendrite backbone (“centerline”) was extracted from the medial axis, and the remaining medial axis spurs branching off the backbone were used to identify candidate spines. A length tolerance was employed to distinguish true spines from artifact spurs. No further analysis beyond that for a single dendritic image was presented. Herzog et al. (1997) employed a 3D reconstruction technique using a parametric model of cylinders with hemispherical ends to t the shape of the dendrites and the spines. In this method, short spines or spines with thin necks were hard to detect and had
1286
Ingrid Y. Y. Koh, et al.
to be manually added to the model. An approach using neural network recognition for spines (Kilborn & Potter, 1998) has also been suggested. In this article, we present an automated dendritic spine detection and analysis procedure appropriate for 3D images obtained using laser scanning microscopy. It is assumed that the image to be analyzed is of a biphase medium, with one phase being the neuronal cytoplasm (dendritic phase) and the other being the background tissue. The method uses a geometric approach, detecting spines as protrusions; it is highly automatic and contains only a few parameter settings. It can be applied to static images as well as time-series data. There is no limitation on the number or the structure of the dendrites in the image. In addition to spine length and density, volumetric measurements and spine classications are obtained. The article is divided into the following sections. The algorithms developed to analyze 3-D scanning microscopy images are described in section 2. The analysis consists of the following steps. An image is rst processed by deconvolution and the dendritic phase extracted (section 2.1). The dendrites are identied via their backbones (section 2.2), which are extracted from a medial axis construction. Unlike the work of Rusakov and Stewart (1995) and Watzel et al. (1995), spines are not detected using the medial axis branches emerging from the backbone, as it is difcult to distinguish true spines from artifacts by this procedure. Instead, spines are detected as geometric protrusions relative to the backbone. Our geometric model and its algorithmic implementation is presented in section 2.3. For time-series data, in which the same dendritic branch is imaged over a sequence of time intervals, translational effects in time are corrected for and individual spines are then traced (section 2.4) through the time-ordered sequence of images. Finally, morphological characterizations of the population of detected spines are extracted (section 2.5). The imaging setup and the biological preparation used to obtain data for testing and verication of these algorithms are summarized in section 3. Results of the application of these algorithms to the analysis of hippocampal CA1 neurons and a small number of hippocampal CA3 neurons are presented in section 4. The accuracy of the automatic approach is addressed by comparison with manual spine detection results.
2 Image Analysis 2.1 Image Deconvolution and Segmentation. The intrinsic spatial resolution limits of optical microscopy arise from the diffraction of light; light from a point source is ideally imaged to a larger spot characterized by the Airy function. The measured spread resulting from a given optical setup is referred to as the PSF. As a result, the intensity recorded in any voxel (volume element) of a digitized image is a convolution of intensities from its neighborhood.
An Image Analysis Algorithm
1287
Deconvolution (Shaw, 1995; Holmes et al., 1995) is typically used to correct aspects of the image degradation due to the PSF. Deconvolution techniques (Lagendijk & Biemond, 1991) employ either theoretical (Gibson & Lanni, 1991) or experimental measures of the PSF. In addition, blind deconvolution methods (Holmes et al., 1995) can be employed that, concurrently with the deconvolution, reconstruct an estimated PSF of the image. However, the presence of noise and the band-limited nature of the PSF limit the improvements by means of classical deconvolution techniques. Therefore, some blurring will remain even after deconvolution due to a trade-off between sharpening of the image and noise amplication. In addition, the photomultiplier tube (PMT) detectors used in most laser scanning microscopes are noisy. Even in darkness, PMTs produce spontaneous bright pixels (shot noise). We deal with shot noise by applying a median lter (Tukey, 1971) to the image. The median lter is a nonlinear, lowpass lter that replaces the gray-scale value of each voxel v in the digitized image by the median grayscale value of v and its 26 neighbors. This effectively removes shot noise but not real spines, which, under typical magnications employed, have an effective width covering many voxels. We use the iterative reblurring deconvolution algorithm of Kawata and Ichioka (1980), which requires a theoretical or experimentally measured PSF. Briey, iterative reblurring proceeds as follows. Let o (0) ( x, y, z) denote the experimental 3D image and h (x, y, z ) an appropriate PSF. Let ¤ and ?denote the convolution and correlation operators (Kawata & Ichioka, 1980). The deconvolved image oO k (x, y, z) in the kth iteration is ( ) ( ) (0) ( ) oO k D oO k¡1 C fo ? h ¡ oO k¡1 ¤ ( h ? h ) g.
A nonnegativity constraint is applied to oO ( k) at the end of each iteration. For the images analyzed in section 4, the PSF was measured by imaging a number of subresolution microspheres (100 nm diameter) and averaging their individual PSFs to reduce noise. Figure 1 demonstrates the result of iterative reblurring applied to an example image. We typically employ kmax D 5 iterations of deblurring. Segmentation is a generic imaging term for labeling each voxel in a grayscale (or color) image with an integer identier designating its “population type.” For dendritic morphometry, this requires distinguishing neuron voxels from the background tissue voxels. A large number of segmentation algorithms are available (Pal & Pal, 1993). As our dendritic images are processed rst by median ltering and deconvolution, we choose to use simple thresholding for this nal segmentation step; all voxels of intensity greater than a threshold value are identied as neuron, otherwise as background. In general, a trade-off between selection of dim spines and reduction of noise on the dendrite surface is made in selecting the threshold.
1288
Ingrid Y. Y. Koh, et al.
Figure 1: The (a) raw and deblurred image after (b) 5 and (c) 20 iterations.
An Image Analysis Algorithm
1289
2.2 Dendritic Backbone Extraction. Geometric analysis of a 3D irregularly shaped object is difcult; such analyses typically employ models based on geometrically simple unit objects such as the cylinders and hemispheres used in the technique of Herzog et al. (1997). Our approach uses the medial axis algorithm of Lee, Kashyap, and Chu (1994) to provide a skeleton from which the backbone of each dendrite in an image can be extracted. Major success in the application of this backbone extraction procedure has been achieved in the analysis of pore space geometry in rock (Lindquist & Venkatarangan, 1999; Lindquist, Venkatarangan, Dunsmuir, & Wong, 2000) and ber mats (Yang, 2000). Intuitively, the medial axis captures a geometrically faithful skeleton (consisting of curve segments joining at vertices) of an object. In a digitized image, these curve segments consist of linked sequences of voxels, with the vertices being voxels at which these segments join together. An example of the medial axis of a portion of a segmented dendritic image is shown in Figure 2a. (This is a view perpendicular to the optical axis.) The medial axis obtained for the dendritic phase contains the backbone (centerline) of each dendrite as a subset. In addition to the backbone, the medial axis contains spurs and other features that correspond to spine-related or non-spine-related surface features (e.g., incipient dendritic branches); to surface artifacts resulting from digitization effects, segmentation errors, and boundary effects due to the nite imaged volume; or to spurious cell debris. Due to resolution limits, spines emerging near each other may appear to have overlapping tips in the digitized image, resulting in the appearance of small loops in the medial axis (top of the parent branch in Figure 2a). A separate skeleton for each disconnected component of dendritic phase is also contained in the medial axis (featured in Figure 2a). From the medial axis, we extract the backbone for each dendrite (see Figure 2b) in two steps. The rst step is achieved by removing the medial axis segments corresponding to all disconnected dendritic components and trimming short spurs and loops on the medial axis. Removal of long spurs is problematic, as they may correspond to either lopodia or incipient branches of the dendrite. These long spurs are dealt with in the next step. In the second step, a backbone is traced through each dendritic branch employing a decision based on minimum deviation angle whenever a vertex on the trimmed skeleton is encountered. If necessary, the number, n, of dendrites in the image can be specied so that only the n longest backbones are retained. Any remaining medial axis segment that is not part of a traced backbone is removed. The nal set of dendritic backbones extracted from Figure 2a is shown in Figure 2b. 2.3 Spine Detection. Dendritic spines are short (» 1m m) protrusions of variable shape that have been characterized (Harris, Jensen, & Tsao, 1992) as falling into one of three types. The type names—stubby, thin, and mushroom—are intended to be descriptive of the geometrical shape of the
1290
Ingrid Y. Y. Koh, et al.
Figure 2: (a) The medial axis from the segmented image of a dendritic image with arrows indicating loop (L) formed by overlapping spines, separate skeletons for disconnected features (D), and spurious cell debris (S). (b) The two backbones extracted from this medial axis.
An Image Analysis Algorithm
1291
spines in each class. Spines, however, exist in a continuum of shape variation over this classication range, and the boundaries established between these three classes are somewhat arbitrary. Thus, in designing an automated spine detection algorithm, we prefer not to search for elements based on prototypes from each class; rather, we prefer a spine model based on a generic protrusion that allows for all three types but rejects nonspine protrusions such as incipient dendritic branches, as well as surface irregularities on the dendrite. Spine candidates are selected using a protrusion criterion (see equation 2.1). Spine detection is complicated by the fact that in mushroomshaped spines, very thin necks are too weak to be detectable. Such spines appear as a detached head-base pair. The protrusion criterion should therefore detect such bases; a separate algorithm is required to detect detached heads. Thus, once the dendritic backbones are isolated, the spine detection algorithm proceeds in three steps: detection of detached spine heads, detection of attached spines (includes bases), and merging of spine components. Since a spine may comprise one or more detached pieces and possibly an attached base in the segmented image, the identication of any spine is not nalized until all three steps have been completed. An optional imposition of a polygon-based trimming of the image volume allows for elimination of regions of noninterest. 2.3.1 Detached Spine Component Detection. Dendritic phase components disconnected from the backbone-containing dendrites are detected and tentatively identied as detached spine heads. For each detached component, a record is kept of its center of mass, the closest dendritic backbone voxel, and the dendritic surface voxel lying on the line joining the center of mass to the backbone voxel. Detached dendritic phase components that are farther from the nearest dendrite surface voxel than a maximum distance tolerance are interpreted as false-positive signals and ignored. For the images analyzed in section 4, the length tolerance is 6m m. 2.3.2 Attached Spine Component Detection. Ignoring the detached spine components, every remaining dendrite phase voxel v is labeled with a distance, db ( v ), to its closest backbone voxel. (This is done using a 26-connected “grassre”; (Leymarie & Levine, 1992) propagating through the dendrite outward from the backbone. db ( v) equals the value of the integer timestep of ignition of voxel v.) Thus, tips of protrusions on the dendrite surface are identied as those voxels assigned the (locally) largest distances. These tip voxel locations are then processed in descending order of db . For each tip voxel S, a sequence fCi g, i D 1, . . . , db ( S ) , of candidate spines is generated. Candidate Ci consists of all voxels w whose distance dS (w ) from S is dS (w ) · i. Figure 3a shows a 2D projected view of the candidates (gray voxels) C1 , . . . , C12 for a tip S having db D 12. The smaller candidates clearly contain insufcient voxels to represent the spine correctly, whereas the larger
1292
Ingrid Y. Y. Koh, et al.
Figure 3: (a) Projected view of 12 spine candidates (shaded gray) with db ( S) D 12. To reduce gure size, the segment of the dendrite to which the spine is attached is shown only with the rst candidate, C1 . (b, c) Sketches of a spine candidate Ci illustrating the denition of geometrical quantities used in the spine detection algorithm.
An Image Analysis Algorithm
1293
candidates protrude too far below the dendrite surface. The optimal choice of a spine candidate would terminate at the surface of the dendrite. This is achieved by estimating the local thickness of the dendrite. To estimate the local dendrite thickness and choose the optimal candidate, a ring of spine-surface boundary points is determined for each candidate. For clarity of explanation, we rst illustrate this algorithm in 2D assuming the image is projected onto its focal plane. The 3D algorithm is described afterward. We use Figures 3b and 3c, which sketch a candidate Ci for a spine of stubby type, to dene the geometrical quantities used in the algorithm and illustrate how they change with orientation angle of the spine. For each candidate Ci , two surface boundary voxels PC1 i and PC2 i and a base voxel ECi having the furthest penetration into the dendrite are determined. The surface points are used to determine the best measure of the dendrite thickness as follows. A reference candidate CR is selected to be that candidate of smallest volume having the minimum surface to backbone distance, dp ´ mini fdb ( PC1 i ) , db ( PC2 i )g. The reference candidate CR and the distance dp are illustrated in Figure 3b. In an ideal case, 2dp is the width of the dendrite, and the best candidate for the spine would be Cj where j D db ( S) ¡ dp C 1. In practice, spines and dendrite surfaces in the images are quite irregular. Therefore, to be accepted as a true spine, we require the candidate to satisfy a heuristic protrusion i criterion. Let DCS!P denote the perpendicular distance from S to the line 1 P2 i segment PC1 i PC2 i of Ci , and let DCE!P denote the perpendicular distance 1 P2
from the base voxel ECi to the line segment PC1 i PC2 i as illustrated in Figure 3c. The spine candidate Cj is required to satisfy i i ¸ DC DCS!P E!P1 P2 . 1 P2
(2.1)
Criterion 2.1 clearly requires that favorable spine candidates protrude out farther from the dendrite surface than they protrude into it. The following argument additionally shows that the criterion favors spine-type protrusions. Consider the spine candidate Ci sketched in Figure 3c. Approximate the base of Ci by the arc of a circle with radius R. Let 2W denote the distance dist PC1 i ! PC2 i . Simple trigonometry gives p i (2.2) DCE!P D R cosh ¡ R2 ¡ W 2 , 1 P2 p i (2.3) DCS!P D R2 ¡ W 2 , 1 P2 where 0 · h · 90± . Substitution of equations 2.2 and 2.3 into criterion 2.1 gives p 2W · 4 ¡ cos2 h . R
(2.4)
1294
Ingrid Y. Y. Koh, et al.
p The right-hand side of equation 2.4 varies over the range [ 3 / 2, 1]. Thus, if we assume that R approximates the spine candidate length and 2W is an approximate measure of the spine candidate neck width, then we see qualitatively that the protrusion criterion “naturally” accepts protrusions having neck width less than spine length. Thus, with the possible exception of stubby spines, the criterion accepts protrusions of spine type. Basing the decision (as to whether a protrusion is indeed a spine) on the single candidate Cj is too fragile, given the irregularity of true spines and the digitized nature of the images. In order to increase the recognition capability, we apply the protrusion criterion to a range of candidates Ci for which i is close to j. Specically, we consider candidates having values i in the range db ( S) ¡ dp · i · db ( S ) ¡ (dp C de ¡ 1) / 2, where de ´ db ( ECR ). If one of the candidates Ci with i value in this range satises criterion 2.1, then the candidate Cj is accepted as a true spine; otherwise, the protrusion is rejected as a spine. The 3D algorithm proceeds in a similar fashion except that instead of using the projected pair of surface boundary points, the entire ring of surface boundary points in 3D is considered. The minimum surface points to backbone distance is used to nd the reference candidate CR to correct for the local thickness of the dendrite. The distance from S (or E) to the ring of voxels is calculated by measuring the perpendicular distance of S (or E) to the plane that best ts the ring of voxels in 3D. 2.3.3 Component Elimination. Spines touching the boundary of the imaged region are ignored as they are incomplete. This is also a useful technique for eliminating debris and other axons or dendrites in the background of the image that are near or touching the dendrites of interest. Our algorithms are written to allow for the imposition of one or more nonoverlapping polygonal areas on the plane of the image slices. The interior of the union of these polygons is regarded as the region of interest for the spine detection algorithm; any structure exterior to the polygons is ignored. By setting the polygonal edges to cross through unwanted structures, they are also automatically ignored. As mentioned, detached components farther from the dendrite surface than a maximum distance are also eliminated. 2.3.4 Component Merging. As a spine may be identied from multiple detached head and attached base components, a nal merging algorithm that accounts for the position and orientation of all possible spine pieces is performed. The merging algorithm considers every component, checking for possible merges with other components. Any merged entity is reconsidered as a new single component and rechecked for possible further merges. Merging can occur between two detached (DD) components (the merged entity is still considered a detached component) or between detached and attached (DA) components (the merged entity is then considered to be an attached component). We employ two criteria for DD or DA type merg-
An Image Analysis Algorithm
1295
D X
D
S
S
X
X
P1 X
X
X
A
P2
P1 X
DA me r g e d
A
X
P2
DA not merged
D1 X
D1 X
D2
q1
X
S1 X B1
X
X
X
S2
q2
B2
D2 X
q1 S1 X
X
S2
X
X
B1
B2
q2
nearest backbone vox el s
DD mer ged
DD not merged
Figure 4: Illustration in 2D of the orientation criterion used to determine if detached (D) or attached (A) spine components need to be merged. (Top) For DA merging, the tip S must lie within the triangle DP1 P2 , where D is the center of mass of the detached component. (Bottom) Detached components D1 and D2 are merged if the angles h1 and h2 subtended by each component with the two surface voxels S1 and S2 satisfy (h1 C h2 ) / 2 · 30o .
ing. The rst criterion is maximum separation: the two components to be merged are required to be close enough (a center-of-mass to center-of-mass separation · 3m m). The second criterion requires appropriate relative orientation of the two components as demonstrated in 2D in Figure 4. For DA type merging, the tip S of the attached component A is required to lie within the triangle DP1 P2 . (In 3D, the tip S is required to lie within the cone determined by D and the ring of spine-surface boundary points.) For DD type merging, the average angle subtended by the center of mass of each spine with the surface voxel locations of both spines is required to be less than 30 degrees.
1296
Ingrid Y. Y. Koh, et al.
2.4 Image Registration and Spine Tracing. A time sequence of 3D images must be registered to correct for possible translational movement of the specimen. After registration, individual spines are traced and identied through the image sequence. Each consecutive pair of images Fi and Fi C 1 is co-registered using the spines separately identied in each image. The offset oE D ( ox , oy , oz ) of Fi C 1 with respect to Fi is allowed to vary within a window |ox | · wx , |oy | · wy , |oz | · wz . Only integer voxel offsets are considered. A common registration method (Pratt, 1991, pp. 651–673) maximizes the cross correlation of two images; thus, no decision can be made until the correlation arrays are computed for all offsets. Instead, we use an efcient sequential search method (Barnea & Silverman, 1972) that computes the l1 norm (absolute value sum) image difference,
e ( oE ) D
XXX i
j
k
|Fi ( i, j, k) ¡ Fi C 1 ( i ¡ ox , j ¡ oy , k ¡ oz ) |,
over all offsets oE in the window for which e ( oE ) is less than a predetermined threshold value T. The offset oE with minimum e ( oE ) provides the optimal registration. In practice, wx D wy D wz D 5 voxels, and T is the average number of total spine voxels in Fi and Fi C 1 . Individual spines are traced through the time series. Two spines at different times are considered to be the same if their percentage overlap (measured in voxels) is larger than 25% of the volume of at least one of them. 2.5 Morphological Characterization. Currently, spine length, density, and volume are computed. Spines are also classied according to their shape. For a detached spine (without any attached component), the spine length is determined by the distance from the recorded associated dendrite surface voxel to the spine voxel farthest away from the dendrite. For spines that are fully or partially attached (consisting of a base and one or more detached components) to the dendrite, the spine length is determined by the distance from the center of mass of the base boundary points to the farthest spine voxel (possibly detached from the dendrite). For the images analyzed in section 4, the automated spine length measurement is calculated from a 2D projection. The reason is that the manual spine analysis measurements against which the automatic analysis results are to be compared are performed in 2D by projecting the 3D stack of image slices along the optical direction. Spine density is computed as the number of spines per unit length of dendritic backbone. For purposes of comparison with manually analyzed images, which are analyzed in 2D projection only, backbone length is also measured from a 2D projection onto the slice plane.
An Image Analysis Algorithm
1297
Table 1: Ratio Criteria for the Classication of Stubby, Thin, and Mushroom Spines. L / dn [0, 2 / 3) [2 / 3, 2) [2, 3) [3, 5) [5, 1)
dh / dn [0, 1.3)
[1.3, 3)
[3, 1)
Stubby Stubby Stubby Thin Thin
Mushroom Stubby Mushroom Mushroom Thin
Mushroom Stubby Mushroom Mushroom Thin
Spine volume is measured according to the intensity values of the deconvolved gray-scale image. For 2PLSM, the excitation of uorescence is limited to a sub-femtoliter focal volume (¼ 0.5 £ 0.5 £ 1.5m m3 ) larger than that of individual spines. The intensity value recorded for each voxel in a spine is a sum of the uorescence from all dye molecules excited within the focal volume. The maximum intensity voxel near the center of a spine is therefore a measure of the volume of a spine. As the larger cross-sectional areas of a dendrite are typically larger than the maximum cross-sectional area of the focal region, the maximum voxel intensity recorded along the dendrite backbone is a measure of the size of the focal volume, assuming the uorescence is saturated near the center of the dendrite (Svoboda et al., 1996; Sabatini & Svoboda, 2000). We dene the spine volume as the ratio of the maximum spine intensity to the maximum dendrite intensity multiplied by an empirically determined focal volume, Spine volume D
Maximum spine intensity £ focal volume. Maximum dendrite intensity
We use the classication of spine shapes (stubby, thin, mushroom) given in Harris et al. (1992; see Figure 3). Spine shape is decided based on spine length (L), head (dh ), and neck diameter (dn ). In general terms, for thin spines, spine length should be much greater than the neck diameter (L À dn ), and head diameter should not exceed neck diameter too excessively. For mushroom spines, spine length should not be excessive, and the head diameter should be much greater than the neck (dh À dn ). For stubby spines, the neck diameter is similar to the length of the spine (dn ¼ L). The specic criteria adopted in our classication use the ratios L / dn and dh / dn as summarized in Table 1. 3 Image Acquisition
From a data analysis standpoint, CLSM and 2PLSM provide essentially equivalent challenges. However, to gain an understanding of the dynamics
1298
Ingrid Y. Y. Koh, et al.
of neuronal circuits, neurons have to be studied in preparations that are as intact as possible. For many questions of subcellular physiology, the living brain slice offers an attractive compromise between the obvious limitations of cultured dissociated neurons and the experimental difculties encountered when working with intact animals. One problem with brain slice physiology has been that scattering of light makes traditional optical microscopies, including CLSM, difcult in living tissues. For these reasons, the data analyzed in this article were collected using 2PLSM, which allows high-resolution uorescence imaging in brain slices up to several hundred microns deep with minimal photodamage (Yuste & Denk, 1995; Svoboda et al., 1996; Denk & Svoboda, 1997; Lendvai et al., 2000). Cultured hippocampal brain slices were prepared as described in Stoppini, Buchs, and Muller (1991) from seven-day-old rats. After ve days in vitro, a small subset of neurons were transfected with a plasmid carrying the gene for enhanced green uorescent protein (GFP). At least two days after transfection, slices were transferred to a perfusion chamber for imaging. Labeled neurons were identied and imaged using a custom-made 2PLSM (Mainen et al., 1999) based on an Olympus Fluoview laser scanning microscope. The light source was a Ti:sapphire laser (Tsunami, SpectraPhysics, Mountainview, CA) running at a wavelength of ¼ 990 nm (repetition frequency 80 MHz; pulse length 150 fs). The average power delivered to the backfocal plane of the objective (40£, NA 0.8) varied depending on the imaging depth (range 30–150 mW). Fluorescence was detected in whole-eld detection mode with a photomultiplier tube (Hamamatsu Corp., Bridgewater, NL). 4 Results 4.1 Static Analysis. To validate the automatic spine detection algorithms, an experiment, E1 , involving ¼ 200 spine measurements over 15 dendrites of hippocampal CA1 and a small number of CA3 neurons was performed. The same imaged regions were subjected to both automatic and manual analysis. A total of 174 spines were identied by both methods, an additional 10 spines were identied only by the manual method, and a further 28 spines were identied only by the automatic method. The results of the manual and automated analysis for one of the dendrites in this experiment are illustrated in Figure 5. Twenty-one spines were detected by both methods; three additional, relatively short spines were detected by the automatic method. Figure 6 compares the individual spine lengths, average spine length, and spine density measured for this particular dendrite. Spines 8 and 22 demonstrate the difculties encountered by both detection methods when two spines appear to overlap. For spine 8, the manual detection has identied only the shorter of two spines, which appear to overlap, whereas the automatic method has identied only the longer. For spine 22, again two spines appear to overlap and are considered
An Image Analysis Algorithm
1299
p24
Stubby Mus hroom Thin
p23
p22
Figure 5: Comparison of (a) manual and (b) automatic spine detection on a segment of a hippocampal CA1 dendrite (projected view). The automatic procedure detects three more spines (p22, p23, and p24) than the manual procedure. Spines are shaded differently according to their shape classication.
as a single spine by the automatic method. On the other hand, the manual method failed to identify either of them. Table 2 compares the mean spine length measured by each method for the population of 174 spines detected in common. The mean lengths for those spine detected by only one of the methods are also presented. For the common detected spine population, the manual and automatic spine length measurements agree to within one standard deviation, although the standard deviations are large. (The large standard deviation is partly due to averaging over spines of different shape classication.) A paired samples Student’s t-test to determine whether the difference in measurements by the two methods is signicant provides a stronger test of the agreement between the two methods of length measurement. Table 3 summarizes the results of the paired t-test; there is no signicant difference between the two methods of length measurement.
1300
Ingrid Y. Y. Koh, et al.
Manual Automated
3 2 1 0 0
5
10
15
20
4
Density (1/micron)
Average Length (microns)
Length (microns)
4
3 2 1 0
0.8 0.6 0.4 0.2 0.0
25
Spine Index Figure 6: Comparison of manual and automatic measurements of individual spine length (left), average spine length (center), and spine density (right) for the dendrite shown in Figure 5. Table 2: Measured Mean Spine Lengths ( § Standard Deviation) for the Spines in Experiment E1 . Population Size 174 174 10 28
Method
Mean Spine Length (m m)
Manual Automatic Manual Automatic
1.05 § 1.08 § 0.75 § 0.38 §
0.62 0.63 0.57 0.28
A one-way ANOVA was used to test for any dependence of dendrite origin on the observed differences in measured spine length. The test produces an F statistic value of 0.95 (dnum D 14 and dden D 159) with a p-value of 0.51. Thus, the differences in spine length measurements between the two methods are uniform across the different dendrites. Table 3: Paired Samples t-Test Results for Measured Spine Lengths and Densities, Experiment E1 .
Degrees of freedom t statistic p-value
Spine Length
Spine Density
173 ¡1.24 0.22, two-sided
14 ¡0.87 0.40, two-sided
An Image Analysis Algorithm
1301
Table 4: Independent Samples t-Test Results, Experiment E1 . Method (population)
Manual (n D 174)
Automatic (n D 174)
Automatic only (n D 28)
df = 200 t D 5.63 p D 6 £ 10¡8 , two-sided
df = 200 t D 5.82 p D 2 £ 10¡8 , two-sided
Manual only (n D 10)
df = 182 t D 1.51 p D 0.13, two-sided
df = 182 t D 1.65 p D 0.10, two-sided
The mean spine length for the 28 spines detected only by the automatic method indicates a population of smaller length spines. The results in Table 4 of an independent samples t-test for these 28 spines show that the difference in spine lengths is signicant compared to that obtained by either measurement method for the population of 174 commonly detected spines. The results indicate that the automatic method is detecting short spines more consistently than the manual method. For the 10 spines detected only by the manual method, the mean length measurement lies midway between that obtained for the common and automatic only populations. An independent samples t-test (see Table 4) shows no signicant differences with the measurements obtained by either method for the 174 commonly detected spines. The results (df = 36, t D 2.65, p D 0.01, 2-sided) of an independent samples t-test between these 10 spines and the 28 detected only by the automatic method indicate a signicant difference between the spine lengths of these two populations. Visual observation of these 10 spines reveals that 7 of them touched the boundary of the image region and were consequently rejected by the automatic method. The remaining 3 were not resolved by the automatic method, as each touched some neighboring spine (which was detected). This is thus a reection of the combined effectiveness of the median lter, deconvolution, and simple thresholding algorithms in segmenting the images. Based on visual investigation of the images, we estimate that no more than 2% of the spines were not resolved by the automatic method due to segmentationrelated effects. Table 5 compares the mean spine densities separately measured by each method for the 15 dendrites. (For the manual method, this is a total population of 184 spines; for the automatic method, 202 spines.) The density measurements by either method agree to within one standard deviation. For the paired sample of 15 dendrites, a Kolmogorov-Smirnov test shows that the dendrite-by-dendrite difference in the automatic and manual measured densities is very close to normal, so that a paired-dendrite samples t-test can be applied. Column 2 of Table 3 summarizes the result; the automatic spine density measurement is not signicantly different from the manual.
1302
Ingrid Y. Y. Koh, et al.
Table 5: Measured Mean Spine Density ( § Standard Deviation) for the Dendrites in Experiment E1 . Population Size
Method
15 15
Mean Spine Density (m m¡1 ) 0.45 § 0.09 0.47 § 0.15
Manual Automatic
For spine volume measurement and shape classication, we report only on automated results because no manual determination is available. A second experiment, E2 , was performed under the same experimental conditions to increase the sample size (E1 C E 2 ) up to ¼ 700 spines. The spine volumes were calculated from the ratio of the maximum intensity values of the spine to the dendrite, as described in section 2.5 using an empirically determined focal volume of 0.5 £ 0.5 £ 1.5m m3 . Figure 7 shows the volume-length correlation plot for the spines measured in experiments E1 and E2 according to their determined classication. The mushroom-shaped spines occupy the widest spectrum of lengths and volumes. The ratio of stubby:mushroom:thin spines is 0.54:0.36:0.10. Table 6 summarizes the average spine volume and length measurements in each shape category and presents comparison with the SSEM results (Harris et al., 1992; Table 4) on rat hippocampus CA1 cells for postnatal day (PND) 15 animals. All automatic measurements are within 1.5 standard deviations of the SSEM result, though our volume results are generally smaller. Note, however, that the automatic results come from cultured neurons and younger-aged animals. In addition, no corrections for any xation-induced changes have been performed in the SSEM study.
Stubby
Volume (micron^3)
0.3
Mushroom
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0.0
0
1
2
3
0.0
0
1
2
Thin
0.3
3
0.0
0
1
2
3
Length (microns)
Figure 7: Spine volume–length scatter plots according to determined spine type for all spines in experiments E1 and E2 .
An Image Analysis Algorithm
1303
Table 6: Comparison of Automated Spine Length and Volume Measurements of Hippocampus Cells in PND 7 Cultured Neurons with the PND 15 SSEM Results of Harris et al. (1992, Table 4). Measurement
3
Method
Volume (m m )
Automatic SSEM
Length (m m)
Automatic SSEM
Shape Classication Stubby
Mushroom
Thin
0.07 § 0.04 0.11 § 0.07
0.06 § 0.05 0.18 § 0.09
0.06 § 0.04 0.05 § 0.03
0.65 § 0.37 0.65 § 0.38
1.35 § 0.55 0.95 § 0.30
1.38 § 0.54 1.40 § 0.39
Number of Spines
4.2 Time-Series Analysis. Time-series data provide the ability to capture dynamic changes in dendritic spine morphology. A series of 50 3D images was taken at 30 second intervals spanning a time period of 25 minutes. Figure 8a shows the number of spines detected in the images as a function of time using the automatic method. On average, 27.5 § 2.3 spines were detected in each image. We are interested in the question of the frequency of observation of any particular spine over time. In total, 52 spines
30 26 22 0
5
10
15
20
25
Time (minutes)
Time (minutes) 25 20 15 10 5 0 0
10
20
30
40
50
Spine Figure 8: (a) Number of spines detected in the image at each time step. (b) Detection history of 52 spines followed for 25 minutes; for example, spine 1 was detected in each image taken over the 25 minute period; spine 14 was seen sporatically over the entire period.
Ingrid Y. Y. Koh, et al.
Length (microns)
1304
4 3 2 1
Volume (micron^3)
0
0.3 0.2 0.1 0.0 0
5
10
15
20
25
Number of Spines
Time (minutes)
15 10 5 0
0.0
0.2
0.4
0.6
0.8
Motility Figure 9: (a) Measured distributions of spine length and volume as a function of time for the population of 52 spines followed in a time series of images. For each time point, the solid circle indicates median value; the horizontal hashmarks indicate (from bottom to top) minimum, rst, and third quartile and maximum values. (b) Measured distribution of spine motility index tted to an exponential decay function n (m ) D n (0) e¡3.69m , where n is the number of spines and m is the motility.
were detected and traced through the time-series. Figure 8b shows how the observations of the 52 spines were distributed in time. The spines were indexed (1 ! 52) according to the rst time in which they appeared. Thus, 27 spines were observed over the full 25 minutes of image taking; among those, 16 were present at all time points. Figure 9a summarizes the distribution of spine length and volume as a function of time. Both length and volume distributions are skewed, with
An Image Analysis Algorithm
1305
smaller lengths and volumes dominating, consistent with the stubby:mushroom:thin spine ratios noted above. The dynamics of the spines are measured using an index for spine motility, which is dened as the summed difference in length of a spine in time divided by the total number of time steps. Figure 9b plots the distribution of motilities for the 52 spines traced in this series. For this limited data set, the number of spines (n) decreases with motility (m) approximately as n ( m ) D n (0) e¡3.69m . A comparison between automated and manual length measurements was made for a limited subset of these time-series data; manual length measurements were made on a subset of 10 of the 52 spines. Figure 10 presents comparisons between the automated and manual spine length measurements as a function of time for 5 of the spines chosen to represent different average lengths. Consistently longer lengths were measured by the manual method for the longer spines (1 and 2). For the medium-length spines (3 and 4), the manual and automated results are very similar. For the short spine (5), some deviations are observed; occasionally, the spines were not detected by the automatic method. A paired t-test was performed on this subset of 10 spines to determine whether the average spine length determined by the automated measurement is signicantly different from that determined by the manual measurement. A Kolmogorov-Smirnov test shows that this sample of 10 difference measurements (¡0.32 § 0.17m m) is very close to normal, so that a paired samples t-test can be applied. The observed t-statistic value is ¡2.79, with p D 0.02, two-sided, revealing some signicance in the averaged difference. While this is contrary to the results in experiment E1 , we note that the manual measurements here were made by a different user than in E1 . Pearson correlation was therefore used to test whether the automatic and manual measurements are correlated in time. The correlation values (r) obtained for the 10 spines for n D 50 time points range from 0.92 to 0.29, with two-sided p-values ranging from 0.00 to 0.04. Thus, for these 10 spines, signicant correlations between the automatic measurements and the manual measurements in time are observed. We therefore conclude the existence of a systematic bias between the manual and automated measurements for this set of data, which we attribute to a change in the user ’s making the manual measurements. The signicant Pearson correlation, however, indicates that the manual measurements are duplicating the trends found by the automatic measurements. 5 Discussion
The automatic procedure presented here offers an objective and consistent analysis that requires a minimal amount of supervision and makes accessible 3D morphological characterizations of spine length, volume, shape classication, and spine density. Comparison of results on spine length and density between the manual and automatic approach on a large number
1306
Ingrid Y. Y. Koh, et al. Manual Automatic
4.5 4.0
Spine 1 (r = 0.92)
3.5 3.0 2.5 3.0
Spine 2 (r = 0.48)
2.5 2.0 1.5
Length (microns)
1.0 2.5
Spine 3 (r = 0.44)
2.0 1.5 1.0 0.5 2.0
Spine 4 (r = 0.51)
1.5 1.0 0.5 0.0 2.0
Spine 5 (r = 0.75)
1.5 1.0 0.5 0.0 0
5
10
15
20
25
Time (minutes) Figure 10: Lengths of ve spines plotted as a function of time showing comparison between manual (close circles) and automated (open circles) measurements. Pearson correlations (r) for each pair of measurements in time are indicated.
of samples has validated the automatic procedure for both static and timelapse images. The approach is highly automatic. At present, parameters that require routine adjustments include the region of interest and segmentation threshold. Other parameters that are used in the deconvolution, backbone extraction, spine component elimination, and tracing algorithms are empirically determined; they remained the same throughout all of the experiments described. Segmentation is crucial to the analysis due to the relatively low intensity associated with small spines and the high intensity of the dendrites. Choosing a critical threshold is important; we nd simple thresholding adequate for most images that have been preprocessed by median ltering and deconvolution.
An Image Analysis Algorithm
1307
The automated analysis greatly enhances speed, consistency, and objectivity. Our timing results show that automated analysis of time-lapse data consisting of 50 images tracking a total of 30 to 50 spines takes about 4 hours CPU time on a Pentium III 500 Hz processor, whereas for manual analysis, an experienced user will take 12 to 16 hours. For static images, the time savings compared to manual analysis may not be as signicant if experimental conditions vary so signicantly that new parameters have to be determined for each image to be analyzed. However, automated analysis still provides more detailed, complete, and objective quantication than manual analysis. The automatic procedure assumes that the spines are simply connected to the dendrites. Small looping structures in the medial axis indicate pairs of spines that are too close to be resolved by imaging or segmentation. If it satises the protrusion criterion, each such structure will be detected as a single entity. Resolution of such structures as paired spines is currently not implemented. We estimate these occurrences to affect less than 2% of the spine population. Our program enables calculation of spine volume, previously not possible with manual analysis. Volumetric measurements offer insight into understanding the electrical capacity of the spines and the structural and electrophysiological properties of neuronal dendrites. It is therefore an important parameter for characterizing dendritic spines. Spine volume has not been reported previously in any of the automatic methods cited in section 1. Our volume measurements agree reasonably well with an SSEM analysis of similarly aged animals, even though the preparation techniques for the specimens are entirely different. The various morphology-based measurement capabilities presented here enable the investigation of the functional signicance of dendritic spines and their plasticity to a wide spectrum of experimental and pathological conditions. Automatic morphometry will signicantly improve the scale and accuracy of such studies. Acknowledgments
We thank Bernardo L. Sabatini for valuable comments and suggestions for the interface of the software. This work is supported by Pew and Whitaker Foundations and NIH. K. Z. is a Merck Fellow of the Helen Hay Whitney Foundation. References Barnea, D. I., & Silverman, H. F. (1972). A class of algorithms for fast image registration. IEEE Trans. Computers, C-21, 179–186. Carlbom, I., Terzopoulos, D., & Harris, K. M. (1994). Computer-assisted registration, segmentation, and 3-D reconstruction from images of neuronal tissue sections. IEEE Trans. Med. Imaging, 13, 351–362.
1308
Ingrid Y. Y. Koh, et al.
Denk, W., Strickler, J. H., & Webb, W. W. (1990). Two-photon laser scanning microscopy. Science, 248, 73–76. Denk, W., & Svoboda, K. (1997).Photon upsmanship: Why multiphoton imaging is more than a gimmick. Neuron, 18, 351–357. Duan, H., He, Y., Wicinski, B., Morrison, J. H., & Hof, P. R. (2000). Agerelated dendrite and spine changes in corticocortically projecting neurons in macaque monkeys. Soc. Neurosci. Abstr., 26, 1237. Eigen, M., & Rigler, R. (1994). Sorting single molecules: Applications to diagnostics and evolutionary biotechnology. Proc. Natl. Acad. Sci. USA, 91, 5740–5747. Engert, F., & Bonhoeffer, T. (1999). Dendritic spine changes associated with hippocampal long-term synaptic plasticity. Nature, 399, 66–70. Fiala, J. C., Feinberg, M., Popov, V., & Harris, K. M. (1998). Synaptogenesis via dendritic lopodia in developing hippocampal. J. Neurosci., 18, 8900–8911. Gibson, S. F., & Lanni, F. (1991). Experimental test of an analytical model of aberration in an oil-immersion objective lens used in three-dimensional light microscopy. J. Opt. Soc. Am. A, 8, 1601–1613. Harris, K. M., Jensen, F. E., & Tsao, B. (1992). Three-dimensional structure of dendritic spines and synapses in rat hippocampus (CA1) at postnatal day 15 and adult ages: Implications for the maturation of synaptic physiology and long-term potentiation. J. Neurosci., 12, 2685–2705. Herzog, A., Krell, G., Michaelis, B., Wang, J., Zuschratter, W., & Braun, K. (1997). Restoration of three-dimensional quasi-binary images from confocal microscopy and its application to dendritic trees. In C. J. Cogswell, J. Conchello, & T. Wilson (Eds.), Proc. SPIE, Three-Dimensional Microscopy: Image Acquisition and Processing IV (pp. 146–157). Holmes, T. J., Bhattacharyya, S., Cooper, J. A., Hanzel, D., Krishnamurthi, V., Lin, W., Roysam, B., Szarowski, D. H., & Turner, J. N. (1995). Light microscopic images reconstructed by maximum likelihood deconvolution. In J. B. Pawley (Ed.), Handbook of biological confocal microscopy (2nd ed., pp. 389–402). New York: Plenum Press. Horner, C. H. (1993). Plasticity of the dendritic spine. Prog. Neurobiol., 41, 281–321. Hosokawa, T., Rusakov, D. A., Bliss, T. V. P., & Fine, A. (1995). Repeated confocal imaging of individual dendritic spines in the living hippocampal slice: Evidence for changes in length and orientation associated with chemically induced LTP. J. Neurosci., 15, 5560–5573. Kawata, S., & Ichioka, Y. (1980). Iterative image restoration for linearly degraded images. I. Basis and II. Reblurring procedure. J. Opt. Soc. Am., 70, 762–772. Kilborn, K., & Potter, S. M. (1998). Delineating and tracking hippocampal dendritic spine plasticity using neural network analysis of two-photon microscopy. Soc. Neurosci. Abstr., 24, 422–425. Lagendijk, R. L., & Biemond, J. (1991). Iterative identication and restoration of images. Norwell, MA: Kluwer Academic. Lee, T. C., Kashyap, R. L., & Chu, C. N. (1994). Building skeleton models via 3-D medial surface/axis thinning algorithms. CVGIP: Graph. Models Image Process., 56, 462–478.
An Image Analysis Algorithm
1309
Lendvai, B., Stern, E. A., Chen, B., & Svoboda, K. (2000). Experience-dependent plasticity of dendritic spines in the developing rat barrel cortex in vivo. Nature, 404, 876–881. Leymarie, R., & Levine, M. D. (1992). Simulating the grassre transform using an active contour model. IEEE Trans. Pattern Anal. Mach. Intell., 14, 56–75. Lindquist, W. B., & Venkatarangan, A. (1999). Investigating 3-D geometry of porous media from high resolution images. Phys. Chem. Earth (A), 25, 593–599. Lindquist, W. B., Venkatarangan, A., Dunsmuir, J., & Wong, T. (2000). Pore and throat size distributions measured from synchrotron X-ray tomographic images of Fontainebleau sandstones. J. Geophys. Research, 105B, 21508–21528. Mainen, Z. F., Maletic-Savatic, M., Shi, S. H., Hayashi, Y., Malinow, R., & Svoboda, K. (1999). Two-photon imaging in living brain slices. Methods, 18, 231–239. Maletic-Savatic,M., Malinow, R., & Svoboda, K. (1999).Rapid dendritic morphogenesis in CA1 hippocampal dendrites induced by synaptic activity. Science, 283, 1923–1927. Moser, M. B., Trommald, M., & Anderson, P. (1994). An increase in dendritic spine density on hippocampal CA1 pyramidal cells following spatial learning in adult rats suggests the formation of new synapses. Proc. Natl. Acad. Sci. USA, 91, 12673–12675. Pal, N. R., & Pal, S. K. (1993). A review on image segmentation techniques. Pattern Recognition, 26, 1277–1294. Pawley, J. B. (Ed.). (1995). Handbook of biological confocal microscopy (2nd ed.). New York: Plenum Press. Pratt, W. K. (1991). In Digital image processing (2nd ed.). New York: WileyInterscience. Purpura, D. P. (1974). Dendritic spine dysgenesis and mental-retardation. Science, 186, 1126–1128. Rusakov, D. A., & Stewart, M. G. (1995). Quantication of dendritic spine populations using image analysis and a tilting disector. J. Neurosci. Methods, 60, 11–21. Sabatini, B. L., & Svoboda, K. (2000). Analysis of calcium channels in single spines using optical uctuation analysis. Nature, 408, 589–593. Shaw, P. J. (1995). Comparison of wide-eld/deconvolution and confocal microscopy for 3-D imaging. In J. B. Pawley (Ed.), Handbook of biological confocal microscopy (2nd ed., pp. 373–387). New York: Plenum Press. Spacek, J. (1994). Design CAD-3D: A useful tool for 3-dimensional reconstructions in biology. J. Neurosci. Methods, 53, 123–124. Stoppini, L., Buchs, P. A., & Muller, D. A. (1991). A simple method of organotypic cultures of nervous tissue. J. Neurosci. Methods, 37, 173–182. Svoboda, K., Denk, W., Kleinfeld, D., & Tank, D. W. (1997). In vivo dendritic calcium dynamics in neocortical pyramidal neurons. Nature, 385, 161–165. Svoboda, K., Tank, D. W., & Denk, W. (1996). Direct measurement of coupling between dendritic spines and shafts. Science, 272, 716–719. Toni, N., Buchs, P. A., Nikonenko, I., Bron, C. R., & Muller, D. (1999). LTP promotes formation of multiple spine synapses between a single axon terminal and a dendrite. Nature, 402, 421–425.
1310
Ingrid Y. Y. Koh, et al.
Tukey, J. W. (1971). Exploratory data analysis, Reading, MA: Addison-Wesley. Watzel, R., Braun, K., Hess, A., Scheich, H., & Zuschratter, W. (1995). Detection of dendritic spines in 3-dimensional images. DAGM-Symposium Bielefeld, 160–167. Wearne, S. L., Straka, H., & Baker, R. (2000). Signal processing in goldsh precerebellar neurons exhibits eye velocity and storage. Soc. Neurosci. Abstr., 26, 7. Yang, H. (2000). A geometric analysis on 3D ber networks from high resolution images. In Proc. of the 2000 Int. Nonwovens Tech. Conf. Carey, NC: International Nonwovens and Disposables Association. Yuste, R., & Denk, W. (1995). Dendritic spines as basic functional units of neuronal integration. Nature, 375, 682–684. Received January 17, 2001; accepted October 1, 2001.
LETTER
Communicated by Geoffrey Goodhill
Multiplicative Synaptic Normalization and a Nonlinear Hebb Rule Underlie a Neurotrophic Model of Competitive Synaptic Plasticity T. Elliott
[email protected] N. R. Shadbolt
[email protected] Department of Electronics and Computer Science, University of Southampton, Higheld, Southampton, SO17 1BJ, U.K. Synaptic normalization is used to enforce competitive dynamics in many models of developmental synaptic plasticity. In linear and semilinear Hebbian models, multiplicative synaptic normalization fails to segregate afferents whose activity patterns are positively correlated. To achieve this, the biologically problematic device of subtractive synaptic normalization must be used instead. Our own model of competition for neurotrophic support, which can segregate positively correlated afferents, was developed in part in an attempt to overcome these problems by removing the need for synaptic normalization altogether. However, we now show that the dynamics of our model decompose into two decoupled subspaces, with competitive dynamics being implemented in one of them through a nonlinear Hebb rule and multiplicative synaptic normalization. This normalization is “emergent” rather than imposed. We argue that these observations permit biologically plausible forms of synaptic normalization to be viewed as abstract and general descriptions of the underlying biology in certain scaleless models of synaptic plasticity.
1 Introduction
Activity-dependent competition between afferent neurons for control of target neurons is a ubiquitous feature of mammalian neuronal development (Purves, 1994). These competitive interactions are thought to lead, for example, to the development of ocular dominance columns (ODCs)— interdigitated domains of control by the left and right eyes—in the primary visual cortex of higher mammals such as Old World monkeys and cats (Hubel & Wiesel, 1962; LeVay, Stryker, & Shatz, 1978; LeVay, Wiesel, & Hubel, 1980). Understanding the mechanisms underlying competition in the nervous system is of central importance to developmental neuroscience, both experimentally and theoretically. c 2002 Massachusetts Institute of Technology Neural Computation 14, 1311– 1322 (2002) °
1312
T. Elliott and N. R. Shadbolt
Theoretically, several approaches to competition have been developed (for reviews, see Swindale, 1996; van Ooyen, 2001), including the Bienenstock-Cooper-Munro model (Bienenstock, Cooper, & Munro, 1982), neurotrophic models (e.g., Harris, Ermentrout, & Small, 1997; Elliott & Shadbolt, 1998a, 1998b), covariance models (Sejnowski, 1977; Linsker, 1986a, 1986b, 1986c), and various other approaches (e.g., Swindale, 1980; Fraser & Perkel, 1989; Montague, Gally, & Edelman, 1991; Tanaka, 1991). But perhaps the most popular models are based on linear or semilinear Hebbian rules coupled with various forms of synaptic normalization as a means of enforcing competitive dynamics (von der Malsburg, 1973; Miller, Keller, & Stryker, 1989; Goodhill, 1993). Although multiplicative synaptic normalization (von der Malsburg, 1973) is not biologically implausible, it does not lead to afferent segregation in the presence of positive correlations in the activity patterns between afferent cells. To segregate afferents in the presence of positive correlations, subtractive synaptic normalization must be used instead (Goodhill & Barrow, 1994; Miller & MacKay, 1994). Whether models actually need to be able to segregate positively correlated afferents in order to be biologically relevant is currently a moot question. ODCs in Old World monkeys develop prior to birth and are adultlike at birth (Horton & Hocking, 1996). Recent data have also questioned whether, as previously assumed, ODCs develop in the ferret after eye opening (Crowley & Katz, 1999, 2000). In the cat, however, despite data revealing non-Hebbian developmental processes (Crair, Gillespie, & Stryker, 1998), it is still tenable to assume that ODCs develop after eye opening, and therefore in the presence of positively correlated inter-ocular images. Thus, models of at least cat ODC development must be able to segregate positively correlated afferents in a plausible manner. In previous work, we have criticized the use of synaptic normalization on two grounds. First, we argued (Elliott & Shadbolt, 1998a, 1998b, 1999; Elliott, Maddison, & Shadbolt, 2001) that synaptic normalization simply describes rather than seeks to explain the nature of competition in the nervous system. Second, we have argued (Elliott & Shadbolt, 1998a, 1998b, 1999; Elliott et al., 2001) that subtractive synaptic normalization is biologically implausible, for reasons explained later. Inspired by experimental results implicating neurotrophic factors in activity-dependent synaptic competition (reviewed in McAllister, Katz, & Lo, 1999), and in an attempt to overcome the difculties of synaptic normalization, we have built a mathematical model based on activity-dependent competition for neurotrophic factors involving anatomical plasticity, and later we extended the model to include simultaneous physiological plasticity (Elliott & Shadbolt, 1998a, 1998b, 1999; Elliott et al., 2001). Critically, our neurotrophic model segregates positively correlated afferents, as required for application to the development of ODCs (Elliott & Shadbolt, 1998b). In this article, we present the results of further analysis of our neurotrophic model. We have found that the model’s dynamics decouple into two essentially independent subspaces, with the competitive dynamics re-
Neurotrophic Models and Synaptic Normalization
1313
siding exclusively in one subspace. In this latter subspace, we show that a nonlinear Hebb (or Hebb-like) rule governs synaptic growth. To our surprise, and contradicting widely held beliefs about the capacity of multiplicative synaptic normalization to segregate positively correlated afferents, we nd that competition is implemented through multiplicative synaptic normalization. This normalization, however, is not imposed, but, in a sense that we shall explain, is “emergent.” We then argue that the key feature of our model that allows this Hebb-like, synaptic normalization description is that the dynamics in the competitive subspace are independent of an overall synaptic scale (i.e., the absolute number of synapses). We suggest that many such models may satisfy this property, thus perhaps sanctioning the use of synaptic normalization as an acceptable, abstract characterization of the underlying competitive process. 2 Reformulations of the Model
In this section, we rst write down our basic model of anatomical, competitive developmental synaptic plasticity based on competition for neurotrophic support and then state a number of key results that, in the nonlinear Hebb formulation, will be seen to be rather more transparent. We then introduce a change of variables, from which an energy function reformulation of the model will reveal two basically decoupled sets of dynamics: one competitive set leading to afferent segregation and one noncompetitive set determining an overall synaptic scale. Finally, we derive the nonlinear Hebb rule formulation, using the decoupled dynamics of the energy function formulation to extract an “emergent” normalization process. 2.1 The Basic Model. Let afferent cells be labeled by letters such as i and j and target cells be labeled by letters such as x and y. Let the number of synapses between afferent cell i and target cell x be sxi , and let the activity of afferent cell i be ai 2 [0, 1]. Then the basic equation governing the time evolution of sxi is given by " Á ! # P X ( a C ai ) ds xi j syj aj D xy T0 C T1 P ¡1 (2.1) D 2 sxi P dt j sxj ( a C aj ) y j syj
(see Elliott & Shadbolt, 1998a, for a detailed derivation and justication). The quantities T 0 and T 1 represent, respectively, an activity-independent and maximum activity-dependent release of NTFs by target cells; a represents a resting uptake term of NTFs by afferents; and 2 is an overall learning rate. The function D xy embodies lateral interactions between target cells x and y. In previous work, we have considered D xy to arise only through the diffusion of NTFs between target cells, so that D xy ¸ 0 8x, y (Elliott & Shadbolt, 1998a, 1998b, 1999). However, we can also consider D xy to arise from both excitatory (D xy > 0) and inhibitory (D xy < 0) lateral synaptic interactions between target cells, either enhancing or reducing the release
1314
T. Elliott and N. R. Shadbolt
of NTFs by target cells. In this case, we can ignore NTF receptor dynamics, which in previous work we have also considered (Elliott & Shadbolt, 1998a). For P convenience, and without much loss of generality, we will assume that y D xy D 1 8x. The quantity c D T 0 / ( aT1 ) is a critical parameter in our model. Previous work has shown that for D xy D dxy (the Kronecker delta), when c < 1 afferent segregation occurs (that is, all but one of sxi go to zero at each target cell x), while for c > 1, afferent segregation breaks down (Elliott & Shadbolt, 1998a). For c < 1, segregation occurs for all but perfectly correlated afferent activity patterns (Elliott & Shadbolt, 1998a). In the presence of a general D xy , the critical value of c is reduced below unity and also becomes a function of afferent correlations, so that too-extensive (positive) lateral interactions or too strong (but not perfect) afferent activity correlations can lead to a breakdown of afferent segregation (unpublished observations). The parameter a also plays a direct role in segregation. As a ! 1, the rate of segregation tends to zero. In the limit, the ratio of the number of synapses supported by any pair of afferents on a given target cell remains xed (Elliott & Shadbolt, 1998a). Formulation. Introducing the new variables sxC D P 2.2 Energy Function P C i sxi and vxi D sxi / sx , so that i vxi ´ 1, equation 2.1 can be rewritten as
dvxi sx D 2 T1 vxi dt C
"P
j ( ai
P
j
# ¡ aj ) vxj X
( a C aj ) vxj
yj
D xy ( ac C aj ) vyj
(2.2)
and X dsx C 2 sxC D 2 T1 D xy (ac C aj ) vyj . dt yj C
(2.3)
We now average over the ensemble of afferent activity patterns. To achieve this, we assume, for tractability, that for n distinct afferents, i D 1, . . . , n, there are just n distinct activity patterns, with pattern number i being dened 6 i, with p 2 [0, 1]. Dening m as the average activity by ai D 1 and aj D p 8j D of an afferent, so that nm D 1 C (n ¡ 1) p, and introducing the parameter r D (1 ¡ p ) / (a C p) , we obtain after some algebra 2 0 1 X C 1 rd dv ij xi ¡ nA C (1 ¡ p ) nsxC D 2 T1 vxi 4a ( c ¡ 1) @ C rvxj 1 dt j 3 X 1 C rdij X £ D xy ( vyj ¡ vxj ) 5 , 1 C rvxj y j
(2.4)
Neurotrophic Models and Synaptic Normalization
1315
and C
ds x C C 2 sx D 2 T1 ( ac C m ) , dt
(2.5)
where, in these two equations, we have dropped for notational convenience the hi brackets on the variables vxi and sxC , these brackets denoting ensemble averaging. Restricting to n D 2 afferents for increased tractability and writing vx D 2vxi ¡ 1 for any one of the two afferents i, we then obtain sxC
X 1 ¡ v2x dvx DO xyvy , D 2 T1 r2 2 2 2 (2 C r ) ¡ r vx y dt C
(2.6)
dsx C D 2 [T1 ( ac C m ) ¡ sx ], dt
(2.7)
DO xy D ( a C m ) D xy ¡ (ac C m )dxy .
(2.8)
where
From equations 2.6 and 2.7, we easily obtain a Lyapunov or energy function E, such that dE / dt · 0 always, where E D E S C EC ,
(2.9)
with ES D
1X [T1 (ac C m ) ¡ sxC ]2 2 x
(2.10)
and EC D ¡
1X O vx D xy vy . 2 xy
(2.11)
In fact, this form for E generalizes rigorously to n > 2 afferents, as we shall demonstrate elsewhere. The stable solutions of equations 2.6 and 2.7, and thus the solutions of the unaveraged equation 2.1 when 2 is sufciently small that the sxi change slowly compared to the afferent activities ai , correspond to the minima of E. E cleanly decomposes into two pieces that may be minimized independently. ES merely sets the overall scale for sxC and is minimized directly by setting sxC D T1 (ac C m ) 8x. The dynamics embodied in this minimization (and therefore the solutions of equations 2.5 or 2.7) are therefore trivial and uninteresting. Minimization of EC , on the other hand, with vx 2 [¡1, 1] corresponding to a target cell “spin” variable denoting control by one afferent (negative vx ) or the other (positive vx ), encapsulates the competitive dynamics of the model. The eigenvalues of the matrix DO entirely determine the character of these dynamics, and EC is, of course, exactly the Hamiltonian of a spin glass (with vx 2 f¡1, C 1g).
1316
T. Elliott and N. R. Shadbolt
2.3 Nonlinear Hebb Rule Formulation. We now return to n afferents in the unaveraged system dened by equations 2.2 and 2.3 and consider one target cell only, so that we examine competition on a pointwise basis. We then may set D xy D dxy and drop the x subscript on the sxC and vxi (and sxi ) variables. Our results, however, easily generalize to multiple target cells with a general D xy . The energy function analysis above reveals that the n independent degrees of freedom in the n si variables P decompose into n ¡ 1 degrees of freedom in the n scaleless vi variables ( i vi D 1, by construction), which entirely capture the competitive dynamics of the model, and one degree of freedom in the s C variable, which entirely captures the scaling dynamics of the model. As argued above, these latter scaling dynamics are quite trivial and uninteresting, and we may discard them without any important loss of generality, restricting attention to the ( n ¡ 1) -dimensional competitive subspace only. Hence, our basic equation is just
Á dvi D vi dt
ac C aC
P P
j vj aj
j vj aj
!
X j
( ai ¡ aj ) vj ,
(2.12)
P with i vi D 1, where we have absorbed a factor of 2 T1 into a redenition of time. In this subspace, we have a synaptic growth rule, equation 2.12, P which automatically normalizes the vi such that i vi D 1 always. We stress, however, that this normalization is the result of the denition of the vi variables (vi D si / s C ) and the fact that these variables are key in capturing the competitive dynamics, with the scaling dynamics decoupling. Nevertheless, from a purely mathematical point of view, we may ask what the form of the growth Prule underlying equation 2.12 is and how the “emergent” normalization i vi D 1 is maintained. Dening PD
ac C aC
P
P
j vj aj
j vj aj
,
(2.13)
which is a purely postsynaptic although nonlinear term, and p i D ai vi ,
(2.14)
which is a presynaptic term, we may rewrite equation 2.12 as 0 1 X dvi p j A P. D p i P ¡ vi @ dt j
(2.15)
The rst term on the right-hand side (RHS) is a nonlinear Hebb growth rule, where, for the moment, we dene a Hebb growth rule as any synaptic
Neurotrophic Models and Synaptic Normalization
1317
growth rule that is expressible as the product of a presynaptic term and a purely postsynaptic term. Were this the only term on the RHS of equation 2.15, it would induce the unconstrained synaptic growth characteristic of Hebb rules. How, then, is this unconstrained P growth forced to remain in the ( n ¡ 1) -dimensional subspace in which i vi D 1? Were we to impose this through multiplicative synaptic normalization, we would modify the Hebb rule by subtracting from the unconstrained growthP term a term proportional to vi and such that this additional term forces i dvi / dt D 0. In P our case, this term would be vi ( j p j ) P, which is exactly the second term on the RHS of equation 2.15. Equation 2.15 therefore represents a nonlinear Hebb rule, with a nonlinear postsynaptic term P, together with multiplicative synaptic normalization. Yet this model segregates afferents for all but perfectly correlated afferent activity patterns (when D xy D dxy ), explicitly contradicting the widely held view that in order to segregate positively correlated afferents, multiplicative normalization fails and subtractive normalization must be used (Goodhill & Barrow, 1994; Miller & MacKay, 1994). Furthermore, it contradicts, or at least dramatically reduces, the signicance of claims that nonlinear Hebbian dynamics are reducible to linear Hebbian dynamics (Miller, 1990), for linear dynamics will not segregate positively correlated afferents under multiplicative normalization (Goodhill & Barrow, 1994; Miller & MacKay, 1994). We discuss these issues more fully later. The form of P allows us to see more readily the model’s behavior in certain limits. When c D 1, we have that P ´ 1. In this case, equation 2.15 is purely presynaptic, and so we would expect, as in fact observed (Elliott & Shadbolt, 1998a), chaotic oscillations in the Pvi as they grow or decay independently of each other (except through i vi D 1). In the limit a ! 1 with ac held xed (because ac D T 0 / T1 , T 0 and T1 being parameters independent of a), P ! 0. Hence, as a grows, the vi evolve more slowly, so that segregation slows down, and, in the limit, the vi are constant in time. P When c < 1, P < 1 and is a monotone increasing function of j vj aj on [0, 1) (although the actual range of this sum is just P [0, 1]), while for c > 1, P > 1 and is a monotone decreasing function of j vj aj . P should therefore not in general be regarded as a postsynaptic ring rate, but rather as a postsynaptic “plasticityP rate,” which, of course, depends on the summed postsynaptic response, j vj aj . While we have dened a Hebb rule as any synaptic growth rule that is expressible as the product of a presynaptic and a postsynaptic term, many Hebb rules are often stated in more restricted forms involving only correlations between presynaptic and postsynaptic P activity. For c < 1, because P is a monotone increasing function of j vj aj , our Hebb rule denition for our model is equivalent to the more restricted denition. Hence, for c < 1, equation 2.15 is Hebbian in both senses, and because our model segregates afferents only for c < 1, this justies our use
1318
T. Elliott and N. R. Shadbolt
of our more general denition of a Hebbian model. However, for c > 1, P is monotone decreasing. Therefore, the growth rule in equation 2.15, although Hebbian by our denition, is not Hebbian according to the more restricted denition. Indeed, for c > 1, equation 2.15 would often be regarded as an anti-Hebbian rule, rewarding anticorrelations between presynaptic and postsynaptic activity. We thus see the dynamical importance of the point c D 1: it corresponds to the transition in the model between a (classical) Hebbian rule and a (classical) anti-Hebbian rule. 3 Discussion
We have extended our earlier analysis of our neurotrophic model of anatomical, competitive synaptic plasticity, which can segregate afferents in the presence of positively correlated afferent activity patterns, and have shown that a change of variables reveals two essentially decoupled sets of dynamics. One set determines an overall synaptic scale and can be discarded without any important loss of generality. The other set completely determines the competitive dynamics underlying the model, independent of the synaptic scale. These competitive dynamics can be viewed in two contrasting ways—as the dynamics of a spin glass, which we have not pursued at any length here, or formulated precisely as a nonlinear Hebbian model with synaptic normalization implemented multiplicatively (cf. Wiskott & Sejnowski, 1998). In contrast to many other models of synaptic competition, this normalization is not imposed, but rather is “emergent,” in the sense that the interesting, competitive interactions of the model restrict themselves dynamically to a lower-dimensional subspace in which a set of variables can be found that fully capture these dynamics and satisfy a normalization constraint. Critically, although synaptic normalization is implemented multiplicatively in this lower-dimensional subspace, our model can segregate positively correlated afferents. Indeed, in the absence of lateral interactions between target cells (so that D xy D dxy), the model can segregate all but perfectly correlated afferents. The fact that a nonlinear Hebb rule and multiplicative synaptic normalization can segregate even strongly positively correlated afferents is intriguing. It is well known that multiplicative normalization together with a linear Hebb rule, or a linear Hebb rule coupled with nonlinearities such as a winner-take-all mechanism, cannot segregate positively correlated afferents (von der Malsburg, 1973; Goodhill & Barrow, 1994; Miller & MacKay, 1994) and that subtractive rather than multiplicative normalization must be used (Miller et al., 1989; Goodhill, 1993). The fact that certain nonlinear Hebbian models are reducible to linear Hebbian models (Miller, 1990) has led to the widespread belief that, in general, no Hebbian model, linear or nonlinear, can segregate positively correlated afferents under multiplicative normalization. Our results constitute an explicit and constructive counterexample to these beliefs.
Neurotrophic Models and Synaptic Normalization
1319
We have frequently criticized synaptic normalization for being a mathematical device that simply imposes rather than seeks to illuminate synaptic competition (Elliott & Shadbolt, 1998a, 1998b, 1999; Elliott et al., 2001). Even if we accept that synaptic normalization underlies competition in the nervous system and accept that normalization could arise from decay or homeostatic mechanisms controlling synaptic efcacy (Turrigiano, Leslie, Desai, Rutherford, & Nelson, 1998), it appears to us implausible to assume, as required by subtractive normalization, that these mechanisms should regulate synaptic efcacy in a fashion that is independent of the concentrations of any of the presynaptic or postsynaptic components that together determine synaptic efcacy. If a model of synaptic plasticity requires an arguably implausible form of synaptic normalization to segregate positively correlated afferents, then perhaps that model should be regarded as implausible. Prior to our results above, such a conclusion would have been unpalatable, as it would have left a vacuum in the space of conventional competitive, synaptic normalization-based Hebbian models that can segregate positively correlated afferents. A few other models will segregate positively correlated afferents (e.g., Bienenstock et al., 1982; Harris et al., 1997), but these do not enforce hard constraints via synaptic normalization on summed synaptic efcacy. Our present model, in its nonlinear Hebb reformulation, lls this vacuum by being an explicit, constructive example of a simple Hebbian model that uses multiplicative normalization to achieve competitive dynamics. This model is almost certainly not unique: presumably an innite number of nonlinear Hebbian models exist that are capable of segregating positively correlated afferents using various forms of synaptic normalization that do not require biologically problematic assumptions. Our model should therefore be construed, from a mathematical point of view, as merely an existence proof of one such model. How do we reconcile our previous criticisms of synaptic normalization (of any form) as a mathematical device and our presentation of our neurotrophic model as a competing alternative with the results above, showing an underlying multiplicative synaptic normalization in the competitive dynamics of our model? There are two related aspects to our reply. First, the synaptic normalization that we have found was in no way imposed from the outset. After changing variables so as to eliminate an overall synaptic scale, we found that the competitive dynamics reside entirely in the scaleless variables (the vxi ), which by denition satisfy a normalization equation, and, moreover, the dynamics of the scaleless variables are basically independent of the dynamics of the scaling variables (the sxC ); at least, the coupling is completely trivial and uninteresting and can be ignored without any important loss of generality. In this sense, then, we have referred to this normalization as “emergent.” Second, mathematically, it will almost always be possible to perform a change of variables and thereby introduce a set of scaleless variables that satisfy a normalization equation. The only case in which this may not be
1320
T. Elliott and N. R. Shadbolt
possible is when the transformation would introduce possibly singular variables, but we shall ignore this possibility. Whether the scaling and scaleless dynamics decouple, it will also always be possible to ask whether the scaleless dynamics can be mathematically separated into a general growth term (a Hebb or Hebb-like term, for example) and a term that maintains the normalization equation (the normalization term that sets the derivative of the summed scaleless variables to zero). If the competitive dynamics reside entirely in the scaleless variables and if the scaleless and scaling variables do not interact in any important fashion, then any such model will possess an underlying synaptic growth rule with competition implemented by some form of synaptic normalization. Given that neurons exhibit such properties as homeostasis and gain control, which can be thought of as mechanisms to eliminate dependence on, or to adjust to, synaptic scale, it is possible that many models of synaptic plasticity can be so reformulated. In this view, in those classes of model of synaptic plasticity in which the underlying competitive dynamics are scale independent, “emergent” synaptic normalization is seen to be inevitable, fully capturing, mathematically speaking, the competitive dynamics. Thus, when rooted in a biologically plausible model of synaptic plasticity, synaptic normalization, provided that its emergent form is not implausible, would appear to be an acceptable, abstract, and general description of the underlying biology. Can we reverse this statement and argue that we may impose competitive dynamics on any given synaptic growth rule by enforcing a hard normalization constraint? The answer is probably in the afrmative, provided that such a model is regarded as tentative, awaiting a derivation from an underlying model whose scaleless dynamics correspond to the given synaptic growth rule. Of course, if it is found that the synaptic growth rule requires an implausible form of synaptic normalization to achieve afferent segregation in the presence of positive correlations, as we believe that linear and semilinear Hebb rules do, then that particular synaptic growth rule should be discarded.
Acknowledgments
T. E. thanks the Royal Society for the support of a University Research Fellowship. References Bienenstock, E. L, Cooper, L. N., & Munro, P. W. (1982). Theory for the development of neuron selectivity: Orientation specicity and binocular interaction in visual cortex. J. Neurosci., 2, 32–48. Crair, M. C., Gillespie, D. C., & Stryker, M. P. (1998). The role of visual experience in the development of columns in cat striate cortex. Science, 279, 566–570.
Neurotrophic Models and Synaptic Normalization
1321
Crowley, J. C., & Katz, L. C. (1999). Early development of ocular dominance columns. Science, 290, 1321–1324. Crowley, J. C., & Katz, L. C. (2000). Development of ocular dominance columns in the absence of retinal input. Nature Neurosci., 2, 1124–1130. Elliott, T., & Shadbolt, N. R. (1998a).Competition for neurotrophic factors: Mathematical analysis. Neural Comp., 10, 1939–1981. Elliott, T., & Shadbolt, N. R. (1998b). Competition for neurotrophic factors: Ocular dominance columns. J. Neurosci., 18, 5850–5858. Elliott, T., & Shadbolt, N. R. (1999). A neurotrophic model of the development of the retinogeniculocortical pathway induced by spontaneous retinal waves. J. Neurosci., 19, 7951–7970. Elliott, T., Maddison, A. C., & Shadbolt, N. R. (2001). Competitive anatomical and physiological plasticity: A neurotrophic bridge. Biol. Cybern., 84, 13–22. Fraser, S. E., & Perkel, D. H. (1989). Competitive and positional cues in the patterning of nerve connections. J. Neurobiol., 21, 51–72. Goodhill, G. J. (1993). Topography and ocular dominance: A model exploring positive correlations. Biol. Cybern., 69, 109–118. Goodhill, G. J., & Barrow, H. G. (1994). The role of weight normalisation in competitive learning. Neural Comp., 6, 255–269. Harris, A. E., Ermentrout, G. B., & Small, S. L. (1997). A model of ocular dominance column development by competition for trophic support. Proc. Natl. Acad. Sci. U.S.A., 94, 9944–9949. Horton, J. C., & Hocking, D. R. (1996).An adult-like pattern of ocular dominance columns in striate cortex of newborn monkeys prior to visual experience J. Neurosci., 16, 1791–1807. Hubel, D. H., & Wiesel, T. N. (1962). Receptive elds, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol., 160, 106–154. LeVay, S., Stryker, M. P., & Shatz, C. J. (1978). Ocular dominance columns and their development in layer IV of the cat’s visual cortex: A quantitative study. J. Comp. Neurol., 179, 223–244. LeVay, S., Wiesel, T. N., & Hubel, D. H. (1980). The development of ocular dominance columns in normal and visually deprived monkeys. J. Comp. Neurol., 191, 1–51. Linsker, R. (1986a). From basic network principles to neural architecture: Emergence of spatial-opponent cells. Proc. Natl. Acad. Sci. U.S.A., 83, 7508–7512. Linsker, R. (1986b). From basic network principles to neural architecture: Emergence of orientation-selective cells. Proc. Natl. Acad. Sci. U.S.A., 83, 8390–8394. Linsker, R. (1986c). From basic network principles to neural architecture: Emergence of orientation columns. Proc. Natl. Acad. Sci. U.S.A., 83, 8779–8783. McAllister, A. K., Katz, L. C., & Lo, D. C. (1999). Neurotrophins and synaptic plasticity. Annu. Rev. Neurosci., 22, 295–318. Miller, K. D. (1990). Derivation of linear Hebbian equations from a nonlinear Hebbian model of synaptic plasticity. Neural Comp., 2, 321–333. Miller, K. D., Keller, J. B., & Stryker, M. P. (1989). Ocular dominance column development: Analysis and simulation. Science, 245, 605–615. Miller, K. D., & MacKay, D. J. C. (1994). The role of constraints in Hebbian learning. Neural Comp., 6, 100–126.
1322
T. Elliott and N. R. Shadbolt
Montague, P. R., Gally, J. A., & Edelman, G. M. (1991). Spatial signaling in the development and function of neural connections. Cereb. Cortex, 1, 199–220. Purves, D. (1994). Neural activity and the growth of the brain. Cambridge: Cambridge University Press. Sejnowski, T. J. (1977). Storing covariance with nonlinearly interacting neurons. J. Math. Biol., 4, 303–321. Swindale, N. V. (1980). A model for the formation of ocular dominance stripes. Proc. Roy. Soc. Lond. Ser. B, 208, 243–264. Swindale, N. V. (1996). The development of topography in the visual cortex: A review of models. Network, 7, 161–247. Tanaka, S. (1991). Theory of ocular dominance column formation— mathematical basis and computer simulation. Biol. Cybern., 64, 263–272. Turrigiano, G. G., Leslie, K. R., Desai, N. S., Rutherford, L. C., & Nelson, S. B. (1998). Activity-dependent scaling of quantal amplitude in neocortical neurons. Nature, 391, 892–896. van Ooyen, A. (2001). Competition in the development of nerve connections: A review of models. Network, 12, R1–R47. von der Malsburg, C. (1973). Self-organization of orientation selective cells in the striate cortex. Kybernetik, 14, 85–100. Wiskott, L., & Sejnowski, T. J. (1998). Constrained optimization for neural map formation: A unifying framework for weight growth and normalization. Neural Comp., 10, 671–716. Received May 11, 2001; accepted September 28, 2001.
LETTER
Communicated by Andreas Andreou
Energy-Efcient Coding with Discrete Stochastic Events Susanne Schreiber
[email protected] Institute of Biology, Humboldt-University Berlin, 10115 Berlin, Germany, and Department of Zoology, University of Cambridge, Cambridge CB2 3EJ, U.K. Christian K. Machens
[email protected] Andreas V. M. Herz
[email protected] Institute of Biology, Humboldt-University Berlin, 10115 Berlin, Germany Simon B. Laughlin
[email protected] Department of Zoology, University of Cambridge, Cambridge CB2 3EJ, U.K.
We investigate the energy efciency of signaling mechanisms that transfer information by means of discrete stochastic events, such as the opening or closing of an ion channel. Using a simple model for the generation of graded electrical signals by sodium and potassium channels, we nd optimum numbers of channels that maximize energy efciency. The optima depend on several factors: the relative magnitudes of the signaling cost (current ow through channels), the xed cost of maintaining the system, the reliability of the input, additional sources of noise, and the relative costs of upstream and downstream mechanisms. We also analyze how the statistics of input signals inuence energy efciency. We nd that energy-efcient signal ensembles favor a bimodal distribution of channel activations and contain only a very small fraction of large inputs when energy is scarce. We conclude that when energy use is a signicant constraint, trade-offs between information transfer and energy can strongly inuence the num ber of signaling molecules and synapses used by neurons and the manner in which these mechanisms represent information.
1 Introduction
Energy and information are intimately related in all forms of signaling. Cellular signaling involves local movements of ions and molecules, shifts in their concentration, and changes in molecular conformation, all of which c 2002 Massachusetts Institute of Technology Neural Computation 14, 1323– 1346 (2002) °
1324
S. Schreiber, C. K. Machens, A. V. M. Herz, and S. B. Laughlin
require energy. Nervous systems have highly evolved cell signaling mechanisms to gather, process, and transmit information, and the quantities of energy consumed by neural signaling can be signicant. In the blowy retina, the transmission of a single bit of information across one chemical synapse requires the hydrolysis of more than 100, 000 ATP molecules (Laughlin, de Ruyter van Steveninck, & Anderson, 1998). The adult human brain accounts for approximately 20% of resting energy consumption (Rolfe & Brown, 1997). Recent calculations suggest that the high rate of energy consumption in cortical gray matter results mainly from the transmission of electrical signals along axons and across synapses (Attwell & Laughlin, 2001). Given that high levels of energy consumption constrain function, it is advantageous for nervous systems to use energy-efcient neural mechanisms and neural codes (Levy & Baxter, 1996; Baddeley et al., 1997; Sarpeshkar, 1998; Laughlin, Anderson, O’Carroll, & de Ruyter van Steveninck, 2000; Schreiber, 2000; Balasubramanian, Kimber, & Berry, 2001; de Polavieja, in press). We set out to investigate the relationship between energy and information at the level of the discrete molecular events that generate cell signals. Ultimately, information is transmitted by the activation and deactivation of signaling molecules. These are generally single proteins or small complexes that respond to changes in electrical, chemical, or mechanical potential. Familiar neural examples are the opening of an ion channel, the binding of a ligand to a receptor, the activation of a G-protein, and vesicle exocytosis. These events involve the expenditure of energy—for example, to restore ions that ow across the membrane, restore G-proteins to the inactive (GGDP) state, and remove and recycle neurotransmitter. The ability of these events to transmit information is limited by their stochasticity (Laughlin, 1989; White, Rubinstein, & Kay, 2000). This uncertainty reduces reliability and hence the quantity of transmitted information. To increase information, one must increase the number of events used to transmit the signal; this, in turn, increases the consumption of energy. We investigate this fundamental relationship between information and energy in molecular signaling systems by developing a simple model within the context of neural information processing: a population of ion channels that responds to an input by changing their open probability. We derive the quantity of information transmitted by the population of channels and demonstrate how information varies as a function of the properties of the input and the number of channels in the population. We identify optima that maximize the ratio between transmitted information and cost. These optima depend on the input statistics and, as with spike codes (Levy & Baxter, 1996), the ratio between the costs of generating signals (the signaling cost) and the cost of constructing the system, maintaining it in a state of readiness and providing it with an input (the xed costs). The article is organized as follows. Section 2 introduces the model for a system that transmits information with discrete stochastic signaling events.
Energy-Efcient Coding
1325
In section 3, we dene measures of information transfer, energy consumption, and metabolic efciency for such a system. In section 4, we analyze the dependence of energy efciency on the number of stochastic units for gaussian input distributions. The inuence of additional noise sources is studied in section 5. In section 6, we look at the energy efciency of combinations of systems, and in section 7, we derive optimal input distributions and show how energy efciency depends on the number of stochastic units when the distribution of inputs is optimal. Finally, in section 8 we conclude our investigation with an extensive discussion. 2 The Model
We consider an information transmission system with input x and output k. The system has N identical, independent units that are activated and deactivated stochastically. The input x directly controls the probability that a unit is activated. The number of activated units, k, constitutes the output of the system. Because realistic physical inputs are bounded in magnitude, any given distribution of inputs can be mapped in a one-to-one fashion onto the activation probabilities of units, within the interval [0I 1]. We therefore assume, without loss of generality, that x 2 [0I 1] is equivalent to the probability of being in an activated state. Consequently, the conditional probability that a given input x activates k units is given by a binomial distribution, p ( k|x ) D
³ ´ N k x (1 ¡ x ) N¡k . k
(2.1)
2 The variance sk|x of this binomial probability distribution, 2 sk|x D Nx (1 ¡ x ) ,
(2.2)
is a measure for the system’s transduction accuracy, dening the magnitude 2 depends on both the number of available units of the noise. Note that sk|x N and the input x. In an equivalent interpretation, the model can also be considered as a linear input-output system, k D Nx C g (N, x ) ,
(2.3)
where Nx is the “deterministic” component of the output and g ( N, x ) represents the noise due to the stochasticity of the units. The noise distribution corresponds to p ( k|x ) shifted to have zero mean; its variance is therefore 2 given by sk|x . Thus, we see that the input x species a noise-free output N ¢ x, to which the noise g is added, yielding the integer-valued output k.
1326
A
S. Schreiber, C. K. Machens, A. V. M. Herz, and S. B. Laughlin
Input x determines open probability of Na+ channels
constant proportion pK open
stochastic
non-stochastic
B N=5
Na+ Channel
K+ Channel
N = 10
Figure 1: (A) Schematic view of the membrane model. The input x directly determines the probability that sodium channels are open. In contrast to the stochastic sodium channels, potassium channels are considered nonstochastic at this stage of analysis; independent of the input, a constant fraction pK is open. (B) Schematic view of two model signaling systems with N D 5 and N D 10 sodium channels, respectively. Note that the ratio of sodium to potassium channels is kept constant (here NK / N D 1).
2.1 Implementing the Model. For concreteness, we have chosen to implement the model in the context of a basic neural signaling mechanism. A membrane contains two populations of voltage-insensitive channels: one controlling inward current and the other outward current (see Figure 1). For convenience, we refer to these as sodium and potassium channels, respectively. There are N sodium channels and NK potassium channels. In our analysis, an input, x, produces a voltage response by changing the open probability of the set of sodium channels, which take the role of the stochastic signaling units. The input x could be derived from a variety of sources, both external, such as light, and internal, such as synaptically released transmitter. But regardless of its origins, the input is assumed to be unambiguously represented by the open probability of sodium channels. For simplicity, the second set of ion channels—the potassium channels—is considered to be input independent and noise free. Thus, a xed proportion of potassium channels is kept open, regardless of the size of the input. The input signal species the voltage output signal by directly determining the probability that sodium channels are open. Thus, a given input value x, presented in a small time interval D t , will result in k open sodium channels. Note that because of channel noise, the number of open sodium channels k will vary with different presentations of the same input. The state of the model system, on the other hand, is given by the number of open sodium channels k and translates uniquely into a voltage output if we neglect the inuence of membrane capacitance. Conversely, if we know the
Energy-Efcient Coding
1327
output voltage V, we can directly infer the number of open sodium channels k. Both variables are therefore equivalent from an information-theoretic point of view. By working in the channel domain (i.e., by taking the number of open sodium channels k as a measure of the system’s output), we can simplify our analysis and avoid the nonlinear relationship of conductance, current, and voltage. Note that to achieve the linear relationship between the input x and the (average) output k, the channels are not voltage sensitive. For simplicity, we do not study the effects of membrane capacitance on the time course of the voltage response, assuming that the signal variation is slow in comparison to the timescale of the effects introduced by membrane capacitance. Nor do we analyze the effects of channel kinetics. By working at a fundamental level, the mapping of the input onto discrete molecular events, we can investigate a simple model of general validity. 3 Calculating the Metabolic Efciency
We dene the metabolic efciency of a signaling system as the ratio between the amount of transmitted information I and the metabolic energy E required. Efciency I / E thus can be expressed in bits per energy unit. (For other efciency measures, see section A.3.) Both information transfer I and metabolic energy (or cost) E depend on the number of available signaling units—the number of channels, N. To investigate the relationship between efciency and the number of stochastic units, we drive the model system with a xed distribution of inputs, px ( x ), and vary N, the size of the population of sodium channels used for signaling. This is equivalent to changing the channel density of our model membrane while maintaining the same input. To ensure that systems with different values of N, the number of sodium channels, produce the same mean voltage output in response to a given input, x, the population of potassium channels, NK , is set to a constant proportion of N. Under these conditions, we can now calculate how the information transmitted, I, and the energy consumed, E, vary for different numbers of channels. 3.1 Information Transfer. Consider the transmission of a single signal. The model system receives an input, x, drawn from the input distribution, px (x ) , and produces a single output, k. According to Shannon, the information transmitted by the system is given by
I[NI px (¢) ] D
N Z X kD 0
µ
1 0
dxp ( k|x ) px (x ) log2
¶ p ( k|x ) , p k ( k)
(3.1)
and depends on the input distribution px (x ) and, via the binomial distribution of channel activation p ( k|x ) (see equation 2.1), also on the number
1328
S. Schreiber, C. K. Machens, A. V. M. Herz, and S. B. Laughlin
of available units, N. For a continuously signaling neuron, this is equivalent to the response during a discrete time bin of duration D t , and I is the rate of transmission in bits/ D t . In this article, we study two different scenarios—gaussian input distributions and input distributions that maximize the information transfer—both in the presence of noise caused by stochastic units (e.g., channel noise). 3.2 Energy. The total amount of energy required for the maintenance of a signaling system and the transmission of a signal is given by
Z E[NI px (¢) ] D b C
1 0
px ( x) e ( x, N ) dx,
(3.2)
where b is the xed cost, and e (x, N ) characterizes the required energy as a function of the input x and the number of stochastic signaling units N. Thus, we classify the metabolic costs into two groups: costs that are independent of N and costs that depend on the total number of channels, N. For simplicity, we assume that the latter costs are dominated by the energy used to generate signals (in this case, restoring the current that ows through ion channels), and we neglect the energy required to maintain channels. The rst group of costs, b, relates to costs that have to be met in the absence of signals, such as the synthesis of proteins and lipids. These costs are therefore called xed costs and are constant with respect to x and N. Because we have set up our systems to produce identical mean values of the voltage output signal given x (by xing the ratio N / NK ), the function e ( N, x ) is separable into the variables N and x (see section A.1), e (x, N ) D NQe ( x) ,
(3.3)
so that the signaling-related total energy consumption rises linearly with N. The function eQ ( x ) increases monotonically between eQ (0) D 0 and eQ (1) D ech , where ech denotes the energy cost associated with a single open sodium channel. (The precise form of eQ ( x ) is derived in section A.1.) Rescaling the measurement units of energy, we will from now on set ech D 1. Altogether, the total energy reads E[NI px (¢) ] D b C eQ (x ) N,
(3.4)
where eQ ( x) is the average signal-related energy requirement of one stochastic unit and the average is taken with respect to the input distribution px ( x) . In the rst part of the analysis, where we analyze energy-efcient numbers of channels, we make the simplifying assumption that the average cost per channel, eQ (x ) , is approximately equal to the mean of the input, x. Note that the energy E is dened as a measure of cost for one time unit D t , just as I is the measure of information transfer in D t.
Energy-Efcient Coding
1329
4 Gaussian Input Distributions 4.1 Information Transfer. We focus on gaussian inputs rst, because according to the central limit theorem, they are a reasonable description of signals composed of many independent subsignals, and they also allow an analytic expression of information transfer. To conne the bulk of the gaussian distributions within the input interval [0I 1], the mean, x, and the variance,sx2 , are chosen such that the distance from the mean to the interval borders is always larger than 3sx . Values falling outside the interval [0I 1] are cut off, and the distribution is then normalized to unity. Numerical simulations (see section A.2) show that the effects of this procedure on the results are negligible. The information transfer I (Shannon & Weaver, 1949) per unit time for a linear system with additive gaussian noise and gaussian inputs is given by
ID
1 log2 (1 C SNR ) , 2
(4.1)
where SNR denotes the signal-to-noise ratio. It is dened as the ratio be2 tween the signal variance, sx2 , and the effective noise variance, sk|x / N 2 . If two criteria are met—rst, the binomial noise g (N, x ) can be approximated by a gaussian and, second, within the regime of most likely inputs x, changes in the noise variance sx|2 k are negligible—the following equation gives a reasonably good approximation of the information transfer of our model system: ³ ´ 1 Nsx2 C log 1 . ID 2 2 x (1 ¡ x)
(4.2)
This is the case for large N and a restriction to the gaussian inputs described above. Numerical tests (see section A.2) show that the deviation between the real information transfer with N stochastic signaling units and the information transfer given by equation 4.2 is very small. 4.2 Efciency. For the efciency, dened as I / E, we obtain the following
expression:
³ ´ 1 I Nsx2 log2 1 C . D 2(xN C b) E x (1 ¡ x )
(4.3)
In Figure 2, efciency curves I / E are depicted as a function of N for three different values of xed costs b. The energy efciency exhibits an optimum for all curves. Signal transmission with numbers of channels N within the range of the optimum is especially energy efcient. The position of the optimum depends strongly on the size of the xed cost b relative to the average cost of a single channel eQ (x ) ¼ x. Figure 2 displays the dependence of the
S. Schreiber, C. K. Machens, A. V. M. Herz, and S. B. Laughlin
efficiency I/E
0.004
A b=500
0.003
2000 1000 2500
fixed cost b
0.002
5000
b=2000 0.001
b=5000 0
2000
4000
6000
8000
number of channels N
10000
B 1
b=5000
0.8 1
0.6
b=50
(I/E)n
normalized efficiency
optimal N
1330
0.5
0.4
0
0.2
5
0
5
log(N/optimal N) 0
1
2
3
N / optimal N
4
5
Figure 2: (A) Efciency I / E as a function of N for three different xed costs. From top to bottom, the xed cost b is equivalent to the cost of 500, 2000, and 5000 open channels. The optimal N shifts to larger N with rising xed cost b (see also the inset), and the efciency curves become wider. (B) Efciency curves for xed costs of b D 50, b D 500, and b D 5000 rescaled, so that the maximum corresponds to (1,1). The inset shows the same data on a logarithmic scale. These rescaled curves, ( I / E) n , are similar in shape, independent of the size of b. For all graphs, the parameters of the input distribution px ( x ) are sx D 0.16 and x D 0.5.
optimal number of channels N on the xed cost b. The most efcient number of channels increases approximately linearly with the size of the xed cost, although close inspection reveals that the slope is steeper for smaller xed costs and shallower in the range of higher b. If b D 0, the most efcient system capable of transmitting information uses one channel. The average costs of the most energy-efcient population of channels, employing Nopt channels, are given by Nopt x. Therefore, the ratio between these signaling costs at xN D 0.5, and the xed cost b is approximately 1:4 in the example depicted. This ratio however, which can be derived from the slope of the
Energy-Efcient Coding
1331
b-Nopt curve in the inset of Figure 2A, strongly depends on the input distribution. The analysis for gaussian input distributions shows that the ratio of Nopt to b increases with decreasing input variance, sx2 (results not shown). Remarkably, the efciency curves rise very steeply for small N and, after passing the optimum, decrease only gradually. This characteristic does not strongly depend on the size of b, as shown in Figure 2B. It is thus very uneconomical to use a number of channels sufciently far below the optimum, whereas above, there is a broad range of high efciency. In this range, a cell could adjust its number of channels to the amount of information needed (e.g., add more channels), without losing much efciency. However, as the inset to Figure 2B indicates, increasing the number of channels by a given factor has a similar effect on efciency as decreasing them by the same factor. 5 Additional Noise Sources
The most energy-efcient number of channels is inuenced by the size of additional noise. This might be noise added to the input (additive input noise) or internal noise generated within the signaling system independent of the activation of sodium channels (input-independent internal noise). 5.1 Additive Input Noise. If the input contains gaussian noise of a xed variance hgx2 i that is not correlated with the signal, the variances of the signal and the noise add, yielding the modied SNR at the output
SNR D
Nsx2 . x (1 ¡ x)
Nhgx2 i C
(5.1)
The additional noise gx decreases the SNR and, consequently, the information transfer. For N ! 1, the SNR converges to sx2 / hgx2 i, and thus sets an upper limit to the information transfer. Figure 3 shows that it is more efcient to operate at lower numbers of channels in the presence of additional signal noise. 5.2 Input-Independent Internal Noise. To exemplify internal noise, we consider an additional population of sodium channels that is not inuenced by the input x but rather has a xed open probability p0 , though it contributes to the noise. Assuming a xed ratio N / N 0 between the total number of the original input-dependent sodium channels N and the total number of these new input-independent sodium channels N 0 , the voltage output is determined by the sum of open channels from both populations k C k0 . Because noise from both populations is uncorrelated, the SNR reads
SNR D
N 2 sx2 , N 2 hg2x i C Nx (1 ¡ x ) C N0 p0 (1 ¡ p0 )
(5.2)
S. Schreiber, C. K. Machens, A. V. M. Herz, and S. B. Laughlin
efficiency I/E
1332
noiseless input
0.0006 0.0004
noisy input
0.0002
0
2000
4000
6000
8000
10000
number of channels N Figure 3: Efciency I / E as function of the number of signaling units, N, for the case of a noisy input signal (solid line). For comparison, we also replot the case where there is no input noise (dotted line), as in Figure 2A. For both curves, the variance of the input distribution equals sx2 D 0.162 , x D 0.5, and b D 5000. The noise variance of the signal is hgx2 i D 0.042 .
where N 0 p0 (1 ¡ p0 ) represents the noise variance of the input-independent population of channels. The efciency of the signaling system is thus further decreased for all N, and the efciency optimum, Nopt , is shifted to lower values. Other noise sources, such as leak channels and additional synaptic inputs, will also lower the SNR. Therefore, they will reduce efciency and inuence the optimum number of channels. 6 Efciency in Systems Combining Several Mechanisms
Signaling mechanisms do not act in isolation; they are usually organized into systems in which one mechanism drives another, either within a cell or between cells. The relationship between information transfer and cost of each mechanism determines optimal patterns of investment in signaling units across the whole system, as we will demonstrate with some simple examples. First, consider two signaling mechanisms in series (see Figure 4). Cell 1 uses N1 channels to convert the input x into the output k1 , which, in turn, drives cell 2, which uses N2 channels, to produce the output k2 . From equation 2.3, we know that k1 D N1 x C g1 , where Nx is the signal and g1 is the noise generated by the random activity of channels. Because we dene an input in terms of a probability distribution of signals, ranging from 0 to 1, the output k1 of cell 1 should be normalized by N1 , so that the input to cell 2 is k1 / N1 . Note that, for simplicity, we are neglecting nonlinearities in signal transfer within a cell, as, for example, in neurotransmitter release. As a consequence, the mean open probability in both cells is the same, but its variance differs. The output of cell 2 is given by k2 D N2 x2 C g2 , where g2
Energy-Efcient Coding
1333
channel noise 1
channel noise 2
output k 1 Cell 1
input x
output k 2 Cell 2
input x 2
input x 3
x 2 = k1 / N1
x 3 = k2 / N2
...
Figure 4: Schematic view of the cell arrangement. The normalized output of cell 1 serves as input for cell 2. Both cells are subject to channel noise g1 and g2 , respectively.
is the additive channel noise of cell 2. Therefore, the information transfer from an input signal x, with mean x and variance sx2 , to the output k2 can be approximated by Shannon’s formula as ID
³ ´ 1 N1 N2 sx2 log2 1 C . ( N1 C N2 ) x (1 ¡ x ) 2
(6.1)
The cost of information transfer through the two-cell system is E D ech1 xN1 C ech2 xN2 C b1 C b2 ,
(6.2)
where b1 and b2 are the xed metabolic costs of the cells and ech1/ 2 x the costs per open channel. i If we introduce an effective number of channels, Neff D N1 N2 / ( N1 C E ) C N2 , for the information transfer and Neff D N1 N2 for the metabolic cost, the equations for I and E correspond to those of the single-cell case— equations 4.2, and 3.4, respectively. For simplicity, the cost per open channel E ¸ i for all nonnegative is set to unity for both cells. Because Neff Neff N1 and N2 , the information transfer increases more slowly with the number of channels than the cost, cutting down efciency. Thus, a two-cell system requires more channels to transmit the same amount of information and is therefore less efcient than a single cell, even if the xed cost of a cell in the two-cell model is only half the cost of the single cell. Consequently, in a metabolically efcient nervous system, serial connections should be made only when signal processing is required.
1334
S. Schreiber, C. K. Machens, A. V. M. Herz, and S. B. Laughlin
Furthermore, in an energy-efcient system, the cost of one mechanism inuences the amount of energy that should be consumed by another. Such considerations are important when one set of signaling events is more costly than another. This can be demonstrated by incorporating the cost of an upstream mechanism into the xed cost of a downstream mechanism (i.e., the cost of a mechanism includes the cost of providing it with a signal). Here we can dene E as E D ech2 xN2 C b¤2
with
b¤2 D ech1 xN1 C b1 C b2 .
(6.3)
As we have seen with one mechanism, an increase in xed cost raises the optimal number of channels. Therefore, when the cost of an upstream mechanism is high, one improves energy efciency by using more units downstream. The precise pattern of investment required for optimal performance will depend on the relative costs (xed and signaling) of every mechanism (e.g., channels) and the characteristics of the input signal. 7 Limits to the Achievable Efciency
Information transfer and efciency depend on the distribution of input signals. In the previous sections, we have considered gaussian inputs. We now calculate the input distribution that maximizes information transfer I given a limited energy budget E and a particular number of channels N. The efciency I / E reached gives an upper bound on the efciency the system can achieve for given E and N. Although the nervous system has less inuence on the distribution of external signals, it is able to shape its internal signals to optimize information transfer (Laughlin, 1981). opt The optimal input distribution, px ( x ), and the maximum information transfer (the information capacity CN ) of a system with N stochastic units can be obtained by the Blahut-Arimoto algorithm (Arimoto, 1972; Blahut, 1972), which is described in further detail in section A.4. Given the noise distribution p ( k|x ), the algorithm yields a numerical solution to the optimization of the input distribution px ( x ), maximizing the information transfer and minimizing the metabolic cost. This algorithm has been applied by Balasubramanian et al. (2001) and de Polavieja (in press) to study the metabolic efciency of spike coding. 7.1 Optimal Input Distributions. For a given number of channels N, a given xed cost b, and a given cost function depending on the input eQ ( x) , the energy E D N eQ ( x ) C b used by the system depends exclusively on the input distribution px (x ) . If the available energy budget is sufciently large, energy constraints do not inuence the shape of the optimal input distribution, and the information capacity CN of the system reaches its maximum (see Figure 5E, point A). The optimum input distribution turns out to be symmetrical, with inputs from the midregion around x D 0.5 drawn less of-
Energy-Efcient Coding
1335
probability
A
0.3 0.2
0.4
B
probability
0.4
0.3 0.2
0.1
0.1
0.2 0.4 0.6 0.8 1
0.8 0.6
in put x
1
D
probability
C
probability
1
0.2 0.4 0.6 0.8 1
in pu t x
0.8 0.6
0.4
0.4
0.2
0.2
0.2 0.4 0.6 0.8 1
0.2 0.4 0.6 0.8 1
information capacity CN
in pu t x
in put x
4
E
3
C
B
A
2
1
0
D
10
20
30
40
energy E
50
60
Figure 5: (A)–(D) Optimal input distributions for different energy budgets E with b D 0 and N D 100. The distributions were calculated numerically and are discretized to a resolution D x D 0.01. All distributions show small values for 2 inputs around x D 0.5 where the noise variance sk|x is highest. With decreasing energy budget, the distributions become less symmetrical, preferring low inputs. (E) Information capacity CN D 100 depending on the energy budget (for b D 0). The points mark the location of the input distributions shown in A–D in the energy capacity space.
ten, whereas inputs at the boundaries 0 and 1 are preferred (see Figure 5A). This result is very intuitive, when we take the dependence of the noise vari2 ance, sk|x , as dened in equation 2.2, on the input x into account. The noise variance is symmetrical as well, showing a maximum at x D 0.5 and falling off toward x D 0 and x D 1. Thus, an input distribution that is optimal from the point of view of information transfer, leaving metabolic considerations aside for a moment, favors less noisy input values over noisier ones.
4
A
0.02
efficiency I/E
5
S. Schreiber, C. K. Machens, A. V. M. Herz, and S. B. Laughlin
3
1 0
b = 20
40
energy
60
80
500 250
Optimal
0.01
2
B
optimal N
1336
0 0
500
1000
fixed cost b
Gaussian
500
number of channels N
1000
Figure 6: (A) Capacity CN D 100 (of Figure 5E) as a function of energy E for a xed cost b D 20 (solid line). The capacity curve is simply shifted along the energy axis by the value of b. The efciency CN / E as a function of E is shown as a dash-dotted line. The maximum efciency (CN / E ) max is given at the point of the CN ( E ) curve whose tangent (dashed line) intersects with the origin. (B) Achievable efciency ( CN / E ) max as a function of N for a xed cost b D 200 (solid line). For comparison, the efciency I / E for a gaussian input distribution (Nx D 0.5, sx D 0.16) is also shown (dashed line). The inset depicts the optimal number of channels Nopt as a function of b.
Limiting the energy that can be used by a system of N units, however, changes the optimal input distribution, px ( x) , and destroys the symmetry. As we reduce the energy budget, values neighboring x D 0 are increasingly preferred and the costly values approaching x D 1 are avoided (see Figures 5B–5D). This asymmetry reduces the information capacity CN . Thus, efcient use of a restricted budget requires a cell to keep most of its units deactivated. For our simple model, this is equivalent to maintaining the membrane close to resting potential by keeping most of its sodium channels closed. The xed cost, b, is a metabolic cost independent of the value of the input x and cannot be avoided. Consequently, it does not inuence the shape of the energy-capacity curve of a system. Adding the xed cost b results merely in a horizontal translation of the energy-capacity curve (see Figure 6A). Here as well, the shape of the input distribution changes with the value of E. 7.2 Efciency. Having obtained the dependence of the information capacity on the energy used for a given N, we can also derive energy efciency CN / E as a function of the energy E. Of particular interest is the maximum value (CN / E ) max D maxE fCN / Eg, giving the optimal efciency for a xed value of N, which is achieved by a specic input distribution px ( x) . Note that this efciency gives the upper bound to the achievable efciency in our system and therefore cannot be surpassed by any other input distribution px ( x ) .
Energy-Efcient Coding
1337
At the maximum of CN / E, the rst derivative with respect to E is zero, @
³
@E
CN E
´ D 0,
(7.1)
which can be transformed to give @CN @E
¢ E D CN .
(7.2)
Thus, geometrically, the maximum value ( CN / E ) max corresponds to the point on the capacity graph whose tangent intersects the origin (see Figure 6A). The slope of the tangent is given by (CN / E ) max itself, so that the optimal efciency (CN / E ) max decreases with increasing b, as can be inferred from Figure 6A by shifting the CN ( E ) curve to the right. Figure 6B shows the optimal efciency (CN / E ) max as a function of the number of channels N for a xed cost b D 200. For comparison, we also show the efciency I / E obtained for a gaussian input distribution. The optimal input distribution surpasses the gaussian distribution roughly by a factor of two in this case. In conclusion, as with gaussian input distributions, we can derive the most efcient number of channels Nopt as a function of the xed cost. The use of optimal input distributions reduces Nopt but, as with gaussian inputs, Nopt rises approximately linearly with the basic cost, b, as illustrated in the inset of Figure 6B. 8 Discussion
A growing number of studies of neural coding suggest that the consumption of metabolic energy constrains neural function (Laughlin, 2001). Comparative studies indicate that the mammalian brain is an expensive tissue whose evolution, development, and function have been shaped by the availability of metabolic energy (Aiello & Wheeler, 1995; Martin, 1996). The human brain accounts for about 20% of an adult’s resting metabolic rate. In children, this proportion can reach 50% and in electric sh 60% (Rolfe & Brown, 1997; Nilsson, 1996). Because much of this energy is used to generate and transmit signals (Siesjo, ¨ 1978; Ames, 2000), these levels of consumption have the potential to constrain neural computation by placing an upper bound on synaptic drive and spike rate (Attwell & Laughlin, 2001). Current evidence suggests that the advantages of an energy-efcient nervous system are not conned to larger animals with highly evolved brains. In general, the specic metabolic rate of brains is 10 to 30 times the average for the whole animal, at rest (Lutz & Nilsson, 1994). In addition, for both vertebrates and insects, the metabolic demands of the brain are more acute in smaller animals because the ratio of brain mass to total body mass de-
1338
S. Schreiber, C. K. Machens, A. V. M. Herz, and S. B. Laughlin
creases as body mass increases (Martin, 1996; Kern, 1985). Moreover, insect species with similar body masses exhibit large differences in the total mass of the brain and in the masses or relative volumes of different brain areas, and these differences correlate with behavior and ecology (Kern, 1985; Gronenberg & Liebig, 1999; Laughlin et al., 2000). Signicant changes have been observed among individuals of a single species. In the ant Harpegnathos, the workers are usually visually guided foragers. However, when young workers are inseminated and begin to lay eggs following the death of their queen, their optic lobes are reduced by 20% (Gronenberg & Liebig, 1999). These observations suggest that the reduction of neural energy consumption is also a signicant factor in the evolution of small brains. The relationship between energy consumption and the ability of neurons to transmit information suggests that nervous systems have evolved a number of ways of increasing energy efciency. These methods include redundancy reduction (Laughlin et al., 1998), the mix of analog and digital operations found in cortical neurons (Sarpeshkar, 1998), appropriate distributions of interspike intervals (Baddeley et al., 1997; Balasubramanian et al., 2001; de Polavieja, 2001), and distributed spike codes (Levy & Baxter, 1996). We have extended these previous theoretical investigations to a level that is more elementary than the analysis of signaling with spikes: the representation of information by populations of ion channels. Thus, as independently advocated by Abshire and Andreou (2001), we have analyzed the energy efciency of information transmission at the level of its implementation by molecular mechanisms. We have estimated the amount of information transmitted by a population of voltage-insensitive channels when their open probability is determined by a dened input. These channels typify the general case of signaling with stochastic events. Consequently, our analysis is also applicable to many other forms of molecular signaling (e.g., the binding of a ligand to a receptor) and to synaptic transmission (e.g., the release of a synaptic vesicle according to Poisson or binomial statistics). Our theoretical results verify a well-known trend: the amount of information carried by a population of channels increases with the size of the population because random uctuations are averaged out, as observed in blowy photoreceptors and their synapses (de Ruyter van Steveninck, Lewen, Strong, Koberle, & Bialek, 1997; Laughlin et al., 1998; Abshire & Andreou, 2000) and demonstrated by models of synaptic transmission to cortical neurons (Zador, 1998). However, increasing the number of channels in the population increases both the level of redundancy and the energy used for transmission, leading to changes in metabolic efciency (Laughlin et al., 1998). Following Levy and Baxter (1996), we have chosen to discuss efciency as the ratio between the number of bits of information transmitted and the energy consumed. However, our analysis also provides a mathematical framework to describe energy efciency from the more general point of view of maximizing information transfer and minimizing the metabolic cost
Energy-Efcient Coding
1339
(maximizing I ¡ sE, where s describes the importance of energy minimization over information maximization), as briey outlined in section A.3. We distinguish two energy costs: the cost of generating signals and the xed cost of keeping a signaling system in a state of readiness. The signaling cost is derived from the current ow through ion channels. Under the assumptions of the model, the signaling cost increases with the number of channels. This simple linear relationship can be easily applied to other forms of signaling, such as protein phosphorylation or the turnover of transmitter and second messenger. In the absence of data on the costs of constructing and maintaining a population of channels in a membrane, we again follow Levy and Baxter (1996). The xed cost is expressed in the same units as the signaling cost and is varied to establish its effect on efciency. The analysis demonstrates that energy efciency is maximized by using a specic number of channels. These optima depend on a number of important factors: the xed cost of the system, the cost of signaling, the reliability of the input, the amount of noise generated by other intrinsic mechanisms, the cost of upstream and downstream signaling mechanisms, and the distribution of the input signals provided by upstream mechanisms. Each of these factors is involved in neural processing. The xed cost of building and maintaining the cell in a physiological state within which the ion channels operate is a dominant factor. When the xed cost increases, the optimum system increases the energy invested in signaling by increasing the number of channels (see Figure 2A). Levy and Baxter (1996) discovered the same effect in their analysis of energy-efcient spike trains. This may well be a general property of energy-efcient systems because when a signaling system is expensive to make and maintain (i.e., the ratio between xed and signaling costs is high), it pays to increase the return on the xed investment by transmitting more bits. This takes more channels and more signaling energy. For the example shown, which operates with a broad input distribution and a mean open probability of 50%, the optimum population of channels has a peak energy consumption (all channels open) that is approximately half the xed cost (see Figure 2A). The relationship between the energy spent on signaling by the channels and the xed cost varies with the distribution of inputs. For input distributions that make a reasonably broad use of possible open probabilities, the ratio between signaling costs and xed costs lies approximately in the range between 1:4 and 1:1. It is difcult to judge whether populations of neuronal ion channels follow this pattern because data about the ratio of xed costs to signaling costs in single cells are not available. However, in more complicated systems, the proportion of energy devoted to signaling is in the predicted range. For the whole mammalian brain, signaling has been linked to approximately 50% of the metabolic rate (Siesj¨o, 1978), and in cortical gray matter this rises to 75% (Attwell & Laughlin, 2001). We are aware that there are a number of additional factors, not accounted for by our model, that will inuence the ratio of signaling costs to xed costs
1340
S. Schreiber, C. K. Machens, A. V. M. Herz, and S. B. Laughlin
in nervous systems. For example, our analysis underestimates the total energy usage by the brain because it is conned to a single operation: the generation of a voltage signal by channels. Within neural systems, signals must be transmitted over considerable distances, and various computations must be performed. These extra operations take extra energy. Along these lines, the transmission of signals along axons, in the form of action potentials, accounts for over 40% of the signaling cost of cortical gray matter (Attwell & Laughlin, 2001). Noise at the input reduces the optimum number of channels (see Figure 3) because it reduces the effect of channel uctuations on the total noise power. There is some evidence that nervous systems reduce the number of signaling events in response to a less reliable input. In the blowy retina, the SNR of chromatic signals is lower than that of achromatic signals, and a class of interneurons involved in the chromatic pathway uses fewer synapses than comparable achromatic interneurons (Anderson & Laughlin, 2000). Considering a chain of signaling mechanisms allows us to study networks where the output from one population of channels denes the SNR at the input of the next. As a result, when analog signals are transferred from one mechanism to another, the noise accumulates stage by stage (Sarpeshkar, 1998). The chain describes how this buildup of noise reduces metabolic efciency. Given this reduction, an energy-efcient system should connect one population of channels (or synapses) to another in a serial manner only when information is actually processed, not when it is merely transmitted. Where signals must be repeatedly amplied to avoid attenuation, pulses should be used to resist the buildup of analog noise (Sarpeshkar, 1998). These design strategies are hypothetical. The energy savings that are made by restricting the number of serial or convergent analog processes and converting analog signals to spikes (Sarpeshkar, 1998; Laughlin et al., 2000) have yet to be demonstrated in a neural circuit. Our analysis suggests that when transmission and processing involve several types of signaling events (e.g., calcium channels, synaptic vesicles, and ligand gated channels at a synapse), it is advantageous to use more of the less expensive events and fewer of the more expensive. This distribution is analogous to the pattern of investment in bones of a mammal’s leg. Proximal bones are thicker than distal bones because they are less costly per unit mass (they move less during a stride). The thickening of distal bones is adjusted to optimize the ratio between the probability of fracture and cost for the limb as a whole (Alexander, 1997). Finally, the probability distribution of the input signal has a large effect on efciency. On both an evolutionary timescale as well as on the timescale of physiological adaptation processes, the way an external signal with specic statistical properties is transmitted could therefore be optimized by mapping the external signal distribution on an efcient distribution of probabilities of channels to be open (which we call the input distribution). More
Energy-Efcient Coding
1341
importantly, the internal signals passed on from one mechanism to the next could be shaped such that the signal distributions employed will enhance the efciency of information transfer. The Blahut-Arimoto algorithm yields input distributions that optimize the amount of information transferred under a cost constraint and has been successfully applied to spike codes (Balasubramanian et al., 2001; de Polavieja, 2001). Our application shows how inputs can be mapped onto the probabilities of activating signaling molecules to maximize the metabolic efciency of analog signaling. The improvement over gaussian inputs is greater than 50% and is achieved by two means. First, signaling avoids the midregion of activation probabilities where, according to binomial statistics, the noise variance is high. Second, signaling avoids the expensive situation of having a high probability of opening channels in favor of the energetically cheaper low-probability condition, similar to the results that a metabolically efcient spike code avoids high rates (Baddeley et al., 1997; Balasubramanian et al., 2001). Our analysis suggests that an efcient population of sodium and potassium channels usually operates close to resting potential, with most of its sodium channels closed, but infrequently switches to opening most of its sodium channels. In other words, there is a tendency toward using a combination of numerous small signals close to a low resting potential and less frequent voltage peaks. In conclusion, we have analyzed the energy efciency of a simple biological system that represents analog signals with stochastic signaling events, such as ion channel activation. Optimum congurations depend on the basic physics that connects information to energy (the dependency of noise, redundancy, and cost on the number of signaling events) and basic economics (the role played by xed costs in determining optimum levels of expenditure on throughput). Given this fundamental basis, the principles that we have demonstrated are likely to apply to other systems. In particular, we have shown that energy efciency is a property of both the component mechanism and the system in which it operates. To assess a single population of ion channels, we had to consider the xed cost, the distribution of signal and noise in the input, and additional noise sources. After connecting two populations of ion channels in series, we had to add the relative costs of the two mechanisms and the noise fed from one population to the next to our list of factors. Energy efciency is achieved by matching the mechanism to its signal or, for optimum input distributions, the signal to the mechanism. Matching is a characteristic of designs that maximize the quantity of information coded by single cells, regardless of cost. To achieve this form of efciency, neurons exploit the adaptability of their cellular and molecular mechanisms (Laughlin, 1994). The extent to which the numbers of channels and synapses used by neurons, and their transfer functions, are regulated for metabolic efciency remains to be seen. The analysis presented here provides a starting point that can guide further experimental and theoretical work.
1342
S. Schreiber, C. K. Machens, A. V. M. Herz, and S. B. Laughlin
Appendix A.1 Energetic Cost of Inputs. For the channel model, the average energetic cost per unit time e ( x, N ) is a function of the input x and the number of sodium channels N. The sodium and potassium currents, iNa and iK , depend on the reversal potentials ENa and EK , the membrane potential V, as well as on the conductances of the sodium and potassium channels, respectively,
iNa ( x) D NgNa0x ( V ( x ) ¡ ENa ) ,
(A.1)
iK ( x) D NK gK0 pK ( V ( x ) ¡ EK ) .
(A.2)
The vigorous electrogenic pump extrudes three sodium ions and takes up two potassium ions for every ATP molecule hydrolyzed. The pump current ipump equals iK / 2, assuming that the pump maintains the internal potassium concentration by accumulating potassium ions at a rate equal to the outward potassium current, iK . Equating all currents across the membrane gives iNa ( x) C iK ( x ) C ipump (x ) D iNa ( x) C
3 iK ( x) D 0. 2
(A.3)
The energetic cost e ( x, N ) is proportional to the pump current, ipump , so that we dene e (x, N ) D c ¢ ipump ( x ) D cN ¢
gK0 pK
(A.4) ±
NK N
3gK0 pK
² ±
gNa0 (ENa ¡ EK ) x ² , NK C 2gNa0 x N
(A.5)
where c is the factor of proportionality. Because ( NNK ) is constant, we can separate e ( x, N ) into the variables N and x: e (x, N ) D NQe ( x) .
(A.6)
The energy function eQ ( x ) can be written as eQ (x ) D C
ABx , Ax C B
(A.7)
with the constants A D 2gNa0, B D 3gK0 pK ( NNK ), and C D c ( ENa ¡ EK ) / 6. Because we dene units of energy in this study such that eQ (1) D ech D 1, the rescaled energy function that is implemented in the Blahut-Arimoto algorithm reads eQ (x ) D
( A C B) x . Ax C B
(A.8)
It does not depend on the values of the reversal potentials ENa and EK .
Energy-Efcient Coding
1343
A.2 Numerical Validation. The signaling system with stochastic units operates under additive binomial noise (caused by the random activation and deactivation of the units), whose variance depends on the input x (see equation 2.2). In order to validate the approximation of the information transfer in such a model by equation 4.2, that, strictly speaking, applies only to systems with additive gaussian noise of a xed, input-independent variance, we performed numerical tests. To this end, the input x was discretized into 1000 equispaced values between zero and one. The information transfer was then calculated according to equation 3.1. The noise distribution 2.1, and the output disp ( k|x) was assumed to be binomial as in equation P tribution p k ( k) was calculated using p k ( k) D x p ( k|x ) px ( x ). We found that the information transfer in the stochastic unit system is well approximated by equation 4.2, as shown in Figure 7. For very broad input distributions (e.g., sx D 0.16), where the approximation error is biggest, the deviations are well below 4% for N > 100. The relative error is signicantly smaller for narrower distributions. The largest deviation below N D 100 occurs for N D 1 and is less than 11% for broad input distributions px ( x ). A ner discretization into 10, 000 input values gives similar results. A.3 Efciency. The most general approach to treat energy as a constraint is the maximization of I ¡ sE, where s is a trade-off parameter between energy and information. The higher s, the tighter is the energy budget. This is a Lagrangian multiplier problem, where for a given, xed energy, the optimal information transfer I and number of channels N have to be determined. However, both the information transfer I and the metabolic cost E depend on only one variable: the number of channels N. So if the energy is xed, N and I are xed too. Thus, each point on the curve depicting energy efciency as a function of the number of channels N (as shown in Figure 2) corresponds to the solution of the general I ¡ sE optimization problem for a particular trade-off parameter s. From these solutions, we have chosen to pay special attention to the ratio I / E. This approach becomes more obvious for the optimization of input distributions with the BlahutArimoto algorithm, where I ¡sE is optimized explicitly before concentrating on the subset of distributions maximizing the ratio I / E. A.4 Information Capacity and Blahut-Arimoto Algorithm. The Blahut-
Arimoto algorithm maximizes the function £ ¤ C[NI s] D max I[NI px (¢) ] ¡ sE[NI px (¢)] , px (¢)
(A.9)
where s is a trade-off parameter that gives the relative importance of energy minimization versus information maximization. The input distribution px ( x ) needs to be discretized. Given an initial input distribution px0 ( x ), the set of conditional probabilities p ( k|x ) , and the expense for each input
1344
S. Schreiber, C. K. Machens, A. V. M. Herz, and S. B. Laughlin x 10e-2
0.5 0.4
0.8
A
0.6
information I / bit
0.3
B
0.4
0.2 0.1
0
2 4 3.5 3 2.5 2 1.5 1 0.5
4
6
0.2
1
0
input x 8
10
2
0.5
C
0.4
4
6
1
input x 8
10
D
0.3 0.2 0
200
400
600
1
input x 800
1000
0.1
0
200
400
600
1
input x 800
1000
number of channels N Figure 7: Numerically derived information transfer as a function of N for the model of stochastic units (lled diamonds) in comparison to the information transfer specied by Shannon’s formula (gray stars) in equation 4.2. The largest relative difference D I is depicted for each set of inputs. The input distributions px ( x ) are shown in the insets. (A) Information transfer for a broad gaussian input distribution px ( x) with sx D 0.16 and x D 0.5 for very small numbers of channels N. (B) The same for a narrow input distribution px ( x) with sx D 0.01 and x D 0.9. (C) Information transfer for large N and a broad input distribution with parameters as in A. (D) Information transfer for large N and the narrow input distribution used in B. Shannon’s formula gives a very good approximation of the information in the channel model. The quality of the approximation increases with rising N and decreasing variance of the input distributions.
symbol, the algorithm iteratively calculates the distribution px (x ) that maximizes C[NI s]. In every step of the iterative algorithm, upper and lower bounds on the information capacity can be derived (Blahut, 1972), which help to estimate the quality of the current probability set. For the numerical calculations, we discretized the input (xj D j/ 200 with j D 0, . . . , 200, xj 2 [0I 1]). The input distributions maximizing information transfer were calculated for N D 10, 20, . . . , 1000 and 26 values of the parameter s, ranging from 0 (no energy restriction) to 20 (high energy constraint). The cost eQ ( x) was assumed to depend on x according to equation A.8, with gNa0 D 20pS, gK0 D 20pS, pK D 0.5, and NK / N D 0.5. The resulting 26 points in C-E space for a given N were subject to a cubic spline interpolation (Figure 5E shows four points and the respective interpolation curve for the
Energy-Efcient Coding
1345
whole set of s; N D 100 here.). Afterward, the energy was scaled by multiplication with N according to equation A.6. For the whole study, the xed cost is always stated in the unit of cost of one open channel (i.e., b D 500 corresponds to the cost of 500 open channels). The optimal number of stochastic units, Nopt , for a given xed cost b was determined choosing the N of all calculated N, where the efciency ( CN / E ) max for that b was maximal. Since those values of N are multiples of 10, the data were subsequently tted with a function f of the form f ( x) D a1 xa2 (a1 and a2 being parameters) in order to obtain a smooth graphical representation. Acknowledgments
We thank Gonzalo Garcia de Polavieja for help and advice, as well as John White and Aldo Faisal for comments on the manuscript. This work is supported by the Daimler-Benz Foundation, the DFG (Graduiertenkolleg 120, Graduiertenkolleg 268, and ITB), the BBSRC, and the Rank Prize Fund. References Abshire, P., & Andreou, A. G. (2000). Relating information capacity to a biophysical model for blowy photoreceptors. Neurocomputing, 32, 9–16. Abshire, P., & Andreou, A. G. (2001). Capacity and energy cost of information in biological and silicon photoreceptors. Proceedings of the IEEE, 89(7), 1052– 1064. Aiello, L. C., & Wheeler, P. (1995). The expensive tissue hypothesis: The brain and the digestive system in human and primate evolution. Curr. Anthropol., 36, 199–221. Alexander, R. M. (1997). A theory of mixed chains applied to safety factors in biological systems. J. Theor. Biol., 184, 247–252. Ames, A. (2000). CNS energy metabolism as related to function. Brain Res. Rev., 34, 42–68. Anderson, J. C., & Laughlin, S. B. (2000). Photoreceptor performance and the coordination of achromatic and chromatic inputs in the y visual system. Vision Research, 40, 13–31. Arimoto, S. (1972). An algorithm for computing the capacity of an arbitrary discrete memoryless channel. IEEE Trans. on Info. Theory, IT-18, 14–20. Attwell, D., & Laughlin, S. B. (2001). An energy budget for signalling in the grey matter of the brain. J. Cereb. Blood Flow Metab., 21, 1133–1145. Baddeley, R., Abbott, L. F., Booth, M. C. A., Sengpiel, F., Freeman, T., Wakeman, E. A., & Rolls, E. T. (1997). Responses of neurons in primary and inferior temporal visual cortices to natural scenes. Proc. R. Soc. Lond. B, 264, 1775– 1783. Balasubramanian, V., Kimber, D., & Berry, M. J. I. (2001). Metabolically efcient information processing. Neural Comput., 13, 799–816. Blahut, R. E. (1972). Computation of channel capacity and rate distortion functions. IEEE Trans. on Info. Theory, IT-18, 460–473.
1346
S. Schreiber, C. K. Machens, A. V. M. Herz, and S. B. Laughlin
de Polavieja, G. G. (in press). Errors drive the evolution of biological signalling to costly codes. J. Theor. Biol. de Ruyter van Steveninck, R. R., Lewen, G. D., Strong, S. P., Koberle, R., and Bialek, W. (1997). Reproducibility and variability in neural spike trains. Science, 275, 1805–1808. Gronenberg, W., & Liebig, J. (1999). Smaller brains and optic lobes in reproductive worker of the ant Harpegnathos. Naturwiss., 86, 343–345. Kern, M. F. (1985). Metabolic rate and the insect brain in relation to body size and phylogeny. Comp. Biochem. Physiol., 81A, 501–506. Laughlin, S. B. (1981). A simple coding procedure enhances a neuron’s information capacity. Z. Naturforsch., 36c, 910–912. Laughlin, S. B. (1989). The reliability of single neurons and circuit design: A case study. In R. Durbin, C. Miall, & G. Mitchison (Eds.), The computing neuron (pp. 322–336). Reading, MA: Addison-Wesley. Laughlin, S. B. (1994). Matching coding, circuits, cells, and molecules to signals: General principles of retinal design in the y’s eye. Prog. Ret. Eye Res., 13, 165–196. Laughlin, S. B. (2001). Energy as a constraint on the coding and processing of sensory information. Curr. Opin. Neurobiol., 11, 475–480. Laughlin, S. B., Anderson, J. C., O’Carroll, D. C., & de Ruyter van Steveninck, R. R. (2000). Coding efciency and the metabolic cost of sensory and neural information. In R. Baddeley, P. Hancock, & P. Foldiak (Eds.), Informationtheory and the brain (pp. 41–61). Cambridge: Cambridge University Press. Laughlin, S. B., de Ruyter van Steveninck, R. R., & Anderson, J. C. (1998). The metabolic cost of neural information. Nat. Neurosci., 1(1), 36–41. Levy, W. B., & Baxter, R. A. (1996). Energy efcient neural codes. Neural Comput., 8(3), 531–543. Lutz, P., & Nilsson, G. E. (1994). The brain without oxygen. Austin: R. G Landes. Martin, R. D. (1996). Scaling of the mammalian brain—the maternal energy hypothesis. News in Physiol. Sci., 11, 149–156. Nilsson, G. E. (1996). Brain and body oxygen requirements of Gnathonemus petersii, a sh with an exceptionally large brain. J. Exp. Biol., 199, 603–607. Rolfe, D. F. S., & Brown, G. C. (1997). Cellular energy utilization and molecular origin of standard metabolic rate in mammals. Physiol. Rev., 77, 731–758. Sarpeshkar, R. (1998). Analog versus digital: Extrapolating from electronics to neurobiology. Neural Comput., 10, 1601–1638. Schreiber, S. (2000). Inuence of channel noise and metabolic cost on neural information transmission. Unpublished diploma thesis available at the library of Humboldt-University Berlin, Germany. Shannon, C. E., & Weaver, W. (1949). The mathematical theory of communication. Urbana: University of Illinois Press. Siesjo, ¨ B. (1978). Brain energy metabolism. New York: Wiley. White, J. A., Rubinstein, J. T., & Kay, A. R. (2000). Channel noise in neurons. TINS, 23(3), 131–137. Zador, A. (1998). Impact of synaptic unreliability on the information transmitted by spiking neurons. J. Neurophysiol., 79, 1219–1229. Received July 23, 2001; accepted October 2, 2001.
LETTER
Communicated by Andrew Barto
Multiple Model-Based Reinforcement Learning Kenji Doya
[email protected] Human Information Science Laboratories, ATR International, Seika, Soraku, Kyoto 619-0288, Japan; CREST, Japan Science and Technology Corporation, Seika, Soraku, Kyoto 619-0288, Japan; Kawato Dynamic Brain Project, ERATO, Japan Science and Technology Corporation, Seika, Soraku, Kyoto 619-0288, Japan; and Nara Institute of Science and Technology, Ikoma, Nara 630-0101, Japan Kazuyuki Samejima
[email protected] Human Information Science Laboratories, ATR International, Seika, Soraku, Kyoto 619-0288, Japan, and Kawato Dynamic Brain Project, ERATO, Japan Science and Technology Corporation, Seika, Soraku, Kyoto 619-0288, Japan Ken-ichi Katagiri
[email protected] ATR Human Information Processing Research Laboratories, Seika, Soraku, Kyoto 619-0288, Japan, and Nara Institute of Science and Technology, Ikoma, Nara 630-0101, Japan Mitsuo Kawato
[email protected] Human Information Science Laboratories, ATR International, Seika, Soraku, Kyoto 619-0288, Japan; Kawato Dynamic Brain Project, ERATO, Japan Science and Technology Corporation, Seika, Soraku, Kyoto 619-0288, Japan; and Nara Institute of Science and Technology, Ikoma, Nara 630-0101, Japan We propose a modular reinforcement learning architecture for nonlinear, nonstationary control tasks, which we call multiple model-based reinforcement learning (MMRL). The basic idea is to decompose a complex task into multiple domains in space and time based on the predictability of the environmental dynamics. The system is composed of multiple modules, each of which consists of a state prediction model and a reinforcement learning controller. The “responsibility signal,” which is given by the softmax function of the prediction errors, is used to weight the outputs of multiple modules, as well as to gate the learning of the prediction models and the reinforcement learning controllers. We formulate MMRL for both discrete-time, nite-state case and continuous-time, continuousstate case. The performance of MMRL was demonstrated for discrete case c 2002 Massachusetts Institute of Technology Neural Computation 14, 1347– 1369 (2002) °
1348
Kenji Doya et al.
in a nonstationary hunting task in a grid world and for continuous case in a nonlinear, nonstationary control task of swinging up a pendulum with variable physical parameters.
1 Introduction
A big issue in the application of reinforcement learning (RL) to real-world control problems is how to deal with nonlinearity and nonstationarity. For a nonlinear, high-dimensional system, the conventional discretizing approach necessitates a huge number of states, which makes learning very slow. Standard RL algorithms can perform badly when the environment is nonstationary or has hidden states. These problems have motivated the introduction of modular or hierarchical RL architectures (Singh, 1992; Dayan & Hinton, 1993; Littman, Cassandra, & Kaelbling, 1995; Wiering & Schmidhuber, 1998; Parr & Russel, 1998; Sutton, Precup, & Singh, 1999; Morimoto & Doya, 2001). The basic problem in modular or hierarchical RL is how to decompose a complex task into simpler subtasks. This article presents a new RL architecture based on multiple modules, each composed of a state prediction model and an RL controller. With this architecture, a nonlinear or nonstationary control task, or both, is decomposed in space and time based on the local predictability of the environmental dynamics. The mixture of experts architecture (Jacobs, Jordan, Nowlan, & Hinton, 1991) has been applied to nonlinear or nonstationary control tasks (Gomi & Kawato, 1993; Cacciatore & Nowlan, 1994). However, the success of such modular architecture depends strongly on the capability of the gating network to decide which of the given modules should be recruited at any particular moment. An alternative approach is to provide each of the experts with a prediction model of the environment and to use the prediction errors for selecting the controllers. In Narendra, Balakrishnan, and Ciliz (1995), the model that makes the smallest prediction error among a xed set of prediction models is selected, and its associated single controller is used for control. However, when the prediction models are to be trained with little prior knowledge, task decomposition is initially far from optimal. Thus, the use of “hard” competition can lead to suboptimal task decomposition. Based on the Bayesian statistical framework, Pawelzik, Kohlmorge, and Muller ¨ (1996) proposed the use of annealing in a “soft” competition network for time-series prediction and segmentation. Tani and Nol (1999) used a similar mechanism for hierarchical sequence prediction. The use of the softmax function for module selection and combination was originally proposed for a tracking control paradigm as the multiple paired forward-inverse models (MPFIM) (Wolpert & Kawato, 1998; Wolpert, Miall, & Kawato, 1998; Haruno, Wolpert, & Kawato, 1999). It was recently
Multiple Model-Based Reinforcement Learning
1349
Figure 1: Schematic diagram of the MMRL architecture.
reformulated as modular selection and identication for control (MOSAIC) (Wolpert & Ghahramani, 2000; Haruno, Wolpert, & Kawato, 2001). In this article, we apply the idea of a softmax selection of modules to the paradigm of reinforcement learning. The resulting learning architecture, which we call multiple model-based reinforcement learning (MMRL), learns to decompose a nonlinear or nonstationary task through the competition and cooperation of multiple prediction models and reinforcement learning controllers. In section 2, we formulate the basic MMRL architecture and in section 3 describe its implementation in discrete-time and continuous-time cases, including multiple linear quadratic controllers (MLQC). We rst test the performance of the MMRL architecture for the discrete case in a hunting task with multiple preys in a grid world (section 4). We also demonstrate the performance of MMRL for continuous case in a nonlinear, nonstationary control task of swinging up a pendulum with variable physical parameters (section 5). 2 Multiple Model-Based Reinforcement Learning
Figure 1 shows the overall organization of the MMRL architecture. It is composed of n modules, each of which consists of a state prediction model and a reinforcement learning controller. The basic idea of this modular architecture is to decompose a nonlinear or nonstationary task into multiple domains in space and time so that within each of the domains, the environmental dynamics is predictable. The
1350
Kenji Doya et al.
action output of the RL controllers, as well as the learning rates of both the predictors and the controllers, are weighted by the “responsibility signal,” which is a gaussian softmax function of the errors in the outputs of the prediction models. The advantage of this module selection mechanism is that the areas of specialization of the modules are determined in a bottom-up fashion based on the nature of the environment. Furthermore, for each area of module specialization, the design of the control strategy is facilitated by the availability of the local model of the environmental dynamics. In the following, we consider a discrete-time, nite-state environment, P ( x ( t) | x ( t ¡ 1) , u ( t ¡ 1) ) D F ( x (t ) , x (t ¡ 1) , u (t ¡ 1) ) , (t D 1, 2, . . .),
(2.1)
where x 2 f1, . . . , Ng and u 2 f1, . . . , Mg are discrete states and actions, and a continuous-time, continuous-state environment, xP ( t) D f ( x ( t) , u (t) ) C º ( t) ,
( t 2 [0, 1) ) ,
(2.2)
where x 2 RN and u 2 RM are state and action vectors, and º 2 RN is noise. Actions are given by a policy—either a stochastic one, P (u ( t) | x (t) ) D G ( u (t) , x (t) ) ,
(2.3)
or a deterministic one, u ( t) D g (x ( t) ).
(2.4)
The reward r ( t) is given as a function of the state x (t) and the action u (t) . The goal of reinforcement learning is to improve the policy so that more rewards are acquired in the long run. The basic strategy of reinforcement learning is to estimate cumulative future reward under the current policy as the “value function” V ( x) for each state and then to improve the policy based on the value function. We dene the value function of the state x ( t) under the current policy as " V ( x (t) ) D E
1 X kD 0
# k
c r (t C k)
in discrete case (Sutton & Barto, 1998) and µZ 1 ¶ s V ( x (t) ) D E e¡ t r (t C s) ds
(2.5)
(2.6)
0
in continuous case (Doya, 2000), where 0 · c · 1 and 0 < t are the parameters for discounting future reward.
Multiple Model-Based Reinforcement Learning
1351
2.1 Responsibility Signal. The purpose of the prediction model in each module is to predict the next state (discrete time) or the temporal derivative of the state (continuous time) based on the observation of the state and the action. The responsibility signal li ( t) (Wolpert & Kawato, 1998; Haruno et al., 1999, 2001) is given by the relative goodness of predictions of multiple prediction models. For a unied description, we denote the new state in the discrete case as
y (t ) D x (t )
(2.7)
and the temporal derivative of the state in the continuous case as y (t ) D xP (t ).
(2.8)
The basic formula for the responsibility signal is given by Bayes’ rule, P ( i) P ( y (t) | i) li ( t) D P (i | y (t) ) D Pn , j D1 P ( j ) P ( y ( t ) | j )
(2.9)
where P (i) is the prior probability of selecting module i and P ( y ( t) | i) is the likelihood of model i given the observation y ( t). In the discrete case, the prediction model gives the probability distribution of the new state xO ( t) based on the previous state x ( t ¡ 1) and the action u ( t ¡ 1) as P (xO ( t) | x (t ¡ 1) , u (t ¡ 1) ) D Fi (xO ( t) , x ( t ¡ 1) , u ( t ¡ 1) ) ( i D 1, . . . , n ).
(2.10)
If there is no prior knowledge of module selection, we take the priors as uniform (P(i) D 1 / n), and then the responsibility signal is given by Fi ( x ( t) , x ( t ¡ 1) , u ( t ¡ 1) ) li ( t) D Pn , j D1 Fj ( x ( t ) , x ( t ¡ 1), u ( t ¡ 1) )
(2.11)
where x ( t) is the newly observed state. In the continuous case, the prediction model gives the temporal derivative of the state: xOP i (t) D fi ( x (t ) , u ( t) ) .
(2.12)
By assuming that the prediction error is gaussian with variance s 2 , the responsibility signal is given by the gaussian softmax function, e li ( t) D P n
¡
1 2s 2
j D1 e
kxP ( t ) ¡xOP i (t )k2
¡
1 2s 2
kxP ( t )¡xOP i (t )k2
,
where xP ( t) is the observed state change.
(2.13)
1352
Kenji Doya et al.
2.2 Module Weighting by Responsibility Signal. In the MMRL architecture, the responsibility signal li ( t) is used for four purposes: weighting the state prediction outputs, gating the learning of prediction models, weighting the action outputs, and gating the learning of reinforcement learning controller:
State prediction: The outputs of the prediction models are weighted by the responsibility signal li (t) . In the discrete case, the prediction of the next state is given by P ( xO ( t) ) D
n X iD 1
li ( t) Fi ( xO ( t) , x ( t ¡ 1) , u ( t ¡ 1) ) .
(2.14)
In the continuous case, the predicted state derivative is given by xOP ( t) D
n X iD 1
li (t ) xOP i ( t) .
(2.15)
These predictions are used in model-based RL algorithms and also for the annealing of s, as described later. Prediction model learning: The responsibility signal li (t) is also used for weighting the parameter update of the prediction models. In general, it is realized by scaling the error signal of prediction model learning by li (t) . Action output: The outputs of reinforcement learning controllers are linearly weighted by li ( t) to make the action output. In the discrete case, the probability of taking an action u ( t) is given by P ( u ( t) ) D
n X
li ( t) Gi (u ( t) , x ( t) ).
(2.16)
i D1
In the continuous case, the output is given by the interpolation of modular outputs u (t) D
n X iD 1
l i ( t ) ui ( t ) D
n X
li ( t) gi (x ( t) ).
(2.17)
iD 1
Reinforcement learning: li ( t) is also used for weighting the learning of the RL controllers. The actual equation for the parameter update varies with the choice of the RL algorithms, which are detailed in the next section. When a temporal difference (TD) algorithm (Barto, Sutton, & Anderson, 1983; Sutton, 1988; Doya, 2000) is used, the TD error, d (t ) D r ( t) C c V ( x ( t C 1) ) ¡ V (x ( t) ) ,
(2.18)
Multiple Model-Based Reinforcement Learning
1353
in the discrete case and d ( t) D rO (t) ¡
1 V ( t) C VP (t) t
(2.19)
in the continuous case, is weighted by the responsibility signal di ( t) D li ( t)d (t )
(2.20)
for learning of the ith RL controller. Using the same weighting factor li ( t) for training the prediction models and the RL controllers helps each RL controller learn an appropriate policy and its value function for the context under which its paired prediction model makes valid predictions. 2.3 Responsibility Predictors. When there is some prior knowledge or belief about module selection, we incorporate the “responsibility predictors” (Wolpert & Kawato, 1998; Haruno et al., 1999, 2001). By assuming that their outputs lO i ( t) are proportional to the prior probability of module selection, from equation 2.9, the responsibility signal is given by
lO i ( t) P ( y ( t) | i) li ( t) D Pn . O j D1 lj ( t ) P ( y ( t ) | j )
(2.21)
In modular decomposition of a task, it is desired that modules do not switch too frequently. This can be enforced by incorporating responsibility priors based on the assumption of temporal continuity and spatial locality of module activation. 2.3.1 Temporal Continuity. The continuity of module selection is incorporated by taking the previous responsibility signal as the responsibility prediction signal. In the discrete case, we take the responsibility prediction based on the previous responsibility, lO i ( t) D li ( t ¡ 1) a ,
(2.22)
where 0 < a < 1 is a parameter that controls the strength of the memory effect. From equations 2.21 and 2.22, the responsibility signal at time t is given by the product of likelihoods of past module selection, li ( t) D
t 1 Y k P ( x ( t ¡ k) | i ) a , Z ( t ) kD 0
where Z ( t) denotes the normalizing factor, that is, Z ( t) D k) | j
k )a .
(2.23) Pn Q t j D1
kD 0 P ( x ( t ¡
1354
Kenji Doya et al.
In the continuous case, we choose the prior Dt lO i ( t) D li ( t ¡ D t) D ta ,
(2.24)
where D t is an arbitrarily small time difference (note 2.24 coincides with 2.22 with D t D 1). Since the likelihood of the module i is given by the gaussian P ( xP ( t) | i) D ¡
1
kxP ( t ) ¡xOP ( t)k2
i , from recursion as in equation 2.23, the responsibility signal e 2s 2 at time t is given by
li ( t ) D D
t/ D t 1 Y kDt P (xP ( t ¡ kD t) | i) D ta Z ( t ) kD 0
1 ¡ 12 D t e 2s Z (t)
Pt / D t kD 0
kxP ( t¡kD t) ¡xOP i ( t¡kD t)k2 akD t
,
(2.25)
that is, a gaussian softmax function of temporally weighted squared errors. In the limit of D t ! 0, equation 2.25 can be represented as e li ( t ) D P n
¡
1 2s 2
jD 1 e
Ei ( t )
¡
1 2s 2
Ej (t )
,
(2.26)
where Ei ( t) is a low-pass ltered prediction error EP i (t ) D log a Ei ( t) C kxP ( t) ¡ xOP i ( t) k2 .
(2.27)
The use of this low-pass ltered prediction error for responsibility prediction is helpful in avoiding chattering of the responsibility signal (Pawelzik et al., 1996). 2.3.2 Spatial Locality. spatial prior, e lO i ( t) D P n
In the continuous case, we consider a gaussian
¡ 12 ( x ( t ) ¡c i ) 0 M¡1 i ( x ( t ) ¡c i )
jD 1 e
( x ( t ) ¡cj ) ¡ 12 (x ( t )¡cj ) 0 M¡1 j
,
(2.28)
where ci is the center of the area of specialization, Mi is a covariance matrix that species the shape, and 0 denotes transpose. These parameters are updated so that they approximate the distribution of the input state x ( t) weighted by the responsibility signal, cP i D gc li ( t) (¡c i C x ( t) ) ,
P i D gM li (t) [¡Mi C ( x (t ) ¡ ci ) ( x ( t) ¡ c i ) 0 ], M where gc and gM are update rates.
(2.29) (2.30)
Multiple Model-Based Reinforcement Learning
1355
3 Implementation of MMRL Architecture
For the RL controllers of MMRL, it is generally possible to use model-free RL algorithms, such as actor-critic and Q-learning. However, because the prediction models of the environmental dynamics are intrinsic components of the architecture, it is advantageous to use these prediction models not just for module selection but also for designing RL controllers. In the following, we describe the use of model-based RL algorithms for discrete-time and continuous-time cases. One special implementation for the continuous-time case is the use of multiple linear quadratic controllers derived from linear dynamic models and quadratic reward models. 3.1 Discrete-Time MMRL. Now we consider implementation of the MMRL architecture for discrete-time, nite-state, and nite-action problems. The standard way of using a predictive model in RL is to use it for action selection by the one-step search,
u (t) D arg max E[Or ( x ( t) , u) C c V ( xO (t C 1) ) ], u
(3.1)
where rO ( x ( t) , u ) is the predicted immediate reward and xO (t C 1) is the next state predicted from the current state x ( t) and a candidate action u. In order to implement this algorithm, we provide each module with a reO x, u ). ward model rOi ( x, u) , a value function Vi (x ) , and a dynamic model Fi ( x, Each candidate action u is then evaluated by q ( x ( t) , u) D E[Or ( x ( t) , u ) C c V (xO ( t C 1) ) | u] D
n X iD 1
li ( t) [Ori ( x (t) , u ) C c
N X xO D 1
O x ( t) , u ) ]]. Vi ( xO ) Fi (x,
(3.2)
For the sake of exploration, we use a stochastic version of the greedy action selection, equation 3.1, where the action u ( t) is selected by a Gibbs distribution, ebq(x(t),u) P (u | x ( t) ) D PM , b q ( x ( t) ,u0 ) u0 D 1 e
(3.3)
where b controls the stochasticity of action selection. The parameters are updated by the error signals weighted by the responsibility signal: li ( t) ( Fi ( j, x ( t ¡ 1) , u ( t ¡ 1) ) ¡ c ( j, x (t) ) ) for the dynamic model (j D 1, . . . , N; c ( j, x ) D 1 if j D x and zero otherwise), li ( t) ( rOi (x ( t) , u ( t) ) ¡r(t) ) for the reward model, and li (t )d ( t) for the value function model. 3.2 Continuous-Time MMRL. Next we consider a continuous-time MMRL architecture. A model-based RL algorithm for a continuous-time,
1356
Kenji Doya et al.
continuous-state system (see equation 2.2) is derived from the HamiltonJacobi-Bellman (HJB) equation, µ ¶ 1 @V ( x ( t) ) ( ( ) ) ( ( ) ) ( ( ) ) C max V xt D r x t ,u f x t ,u , u t @x
(3.4)
where t is the time constant of reward discount (Doya, 2000). Under the assumptions that the system is linear with respect to the action and the action cost is convex, a greedy policy is given by ³ uDg
0
@f ( x, u ) @V ( x ) 0 @u
´
@x
,
(3.5)
@V ( x) 0 is a vector representing the steepest ascent direction of the value @x @f ( x,u ) 0 function, @u is a matrix representing the input gain of the dynamics, and
where
g is a sigmoid function whose shape is determined by the control cost (Doya, 2000). To implement the HJB-based algorithm, we provide each module with a dynamic model fi (x, u ) and a value model Vi ( x ). The outputs of the dynamic models, equation 2.12, are compared with the actually observed state dynamics xP ( t) to calculate the responsibility signal li ( t) according to equation 2.13. The model outputs are linearly weighted by li ( t) for state prediction, xOP ( t) D
n X
li ( t) fi (x ( t) , u ( t) ) ,
(3.6)
i D1
and value function estimation, V ( x) D
n X
li ( t) Vi ( x ).
(3.7)
iD 1 (
)
The derivatives of the dynamic models @fi @x,u and value models u used to calculate the action for each module: ³ ui ( t ) D g
@Vi ( x) @x
are
´
0 @fi ( x, u ) @Vi ( x ) 0
@u
@x
.
(3.8)
x ( t)
They are then weighted by li ( t) according to equation 2.17 to make the actual action u (t) . Learning is based on the weighted prediction errors li ( t) ( xOP i ( t) ¡ xP (t) ) for dynamic models and li ( t)d (t) for value function models.
Multiple Model-Based Reinforcement Learning
1357
3.3 Multiple Linear Quadratic Controllers. In a modular architecture like the MMRL, the use of universal nonlinear function approximators with large numbers of degrees of freedom can be problematic because it can lead to an undesired solution in which a single module tries to handle most of the task domain. The use of linear models for the prediction models and the controllers is a reasonable choice because local linear models have been shown to have good properties of quick learning and good generalization (Schaal & Atkeson, 1996). Furthermore, if the reward function is locally approximated by a quadratic function, then we can use a linear quadratic controller (see, e.g., Bertsekas, 1995) for the RL controller design. We use a local linear dynamic model,
xOP i (t) D Ai ( x (t) ¡ xdi ) C Bi u ( t) ,
(3.9)
and a local quadratic reward model, rOi ( x (t) , u (t ) ) D ri0 ¡
1 1 (x ( t) ¡ xri ) 0 Qi ( x ( t) ¡ xri ) ¡ u0 ( t) Ri u ( t) , 2 2
(3.10)
for each module, where xdi , xri is the center of local prediction for state and reward, respectively. The ri0 is a bias of quadratic reward model. The value function is given by the quadratic form, Vi ( x ) D vi0 ¡
1 ( x ¡ xvi ) 0 Pi ( x ¡ xvi ) . 2
(3.11)
The matrix Pi is given by solving the Riccati equation, 0D
1 Pi ¡ Pi Ai ¡ A0i Pi C Pi B i Ri¡1 B0i Pi ¡ Qi . t
(3.12)
The center xvi and the bias vi0 of the value function are given by xvi D ( Qi C Pi Ai ) ¡1 ( Qi xri C Pi Ai xdi ) , 1 0 1 v D ri0 ¡ ( xvi ¡ xri ) 0 Qi ( xvi ¡ xri ). t i 2
(3.13) (3.14)
Then the optimal feedback control for each module is given by the linear feedback, ui (t ) D ¡Ri¡1 BTi Pi ( x (t) ¡ xvi ).
(3.15)
The action output is given by weighting these controller outputs by the responsibility signal li (t) : u (t) D
n X iD 1
l i ( t ) ui ( t ) .
(3.16)
1358
Kenji Doya et al.
The parameters of the local linear models Ai , Bi , and xdi and those of the quadratic reward models ri0, Qi , and Ri are updated by the weighted prediction errors li ( t) ( xOP i ( t) ¡xP ( t) ) and li ( t) ( rOi ( x, u ) ¡r(t) ) , respectively. When we assume that the update of these models is slow enough, then the Riccati equations, 3.12, may be recalculated only intermittently. We call this method multiple linear quadratic controllers (MLQC).
4 Simulation: Discrete Case
In order to test the effectiveness of the MMRL architecture, we rst applied the discrete MMRL architecture to a nonstationary hunting task in a grid world. The hunter agent tries to catch a prey in a 7£7 torus grid world. There are 47 states representing the position of the prey relative to the hunter. The hunter chooses one of ve possible actions: fnorth (N), east (E), south (S), west (W), stayg. A prey moves in a xed direction during a trial. At the beginning of each trial, one of four movement directions fNE, NW, SE, SWg is randomly selected, and a prey is placed at a random position in the grid world. When the hunter catches the prey by stepping into the same grid with the prey, a reward r (t ) D 10 is given. Each step of movement costs r ( t) D ¡1. A trial is terminated when the hunter catches a prey or fails to catch it within 100 steps. In order to compare the performance of MMRL with conventional methods, we applied standard Q-learning and compositional Q-learning (CQ-L) (Singh, 1992) to the same task. A major difference between CQ-L and MMRL is the criterion for modular decomposition: CQ-L uses the consistency of the modular value functions, while MMRL uses the prediction errors of dynamic models. In CQ-L, the gating network as well as component Q-learning modules are trained so that the composite Q-value well approximates the action value function of the entire problem. In the original CQ-L (Singh, 1992), the output of the gating network was based on the “augmenting bit” that explicitly signaled the change in the context. Since our goal now is to let the agent learn appropriate decomposition of the task without an explicit cue, we used a modied CQ-L (see the appendix for the details of the algorithm and the parameters). 4.1 Results. Figure 2 shows the performance difference of standard Qlearning, CQ-L, and MMRL in the hunting task. The modied CQ-L did not perform signicantly better than standard, at Q-learning. Investigation of the modular Q functions of CQ-L revealed that in most simulation runs, modules did not appropriately differentiate for four different kinds of preys. On the other hand, the performance of MMRL approached close to theoretical optimum. This was because four modules successfully specialized in one of four kinds of prey movement.
Multiple Model-Based Reinforcement Learning
1359
Figure 2: Comparison of the performance of standard Q-learning (gray line), modied CQ-L (dashed line), and MMRL (thick line) in the hunting task. The average number of steps needed for catching a prey during 200 trial epochs in 10 simulation runs is plotted. The dash-dotted line shows the theoretically minimal average steps required for catching the prey.
Figure 3 shows examples of the value functions and the prediction models learned by MMRL. From the output of the prediction models Fi , it can be seen that the modules 1, 2, 3, and 4 were specialized for the prey moving to NE, NW, SW, and SE, respectively. The landscapes of the value functions Vi (x ) are in accordance with these movement directions of the prey. A possible reason for the difference in the performance of CQ-L and MMRL in this task is the difculty of module selection. In CQ-L, when the prey is far from the hunter, the differences in discounted Q values for different kinds of prey are minor. Thus, it would be difcult to differentiate modules based solely on the Q values. In MMRL, on the other hand, module selection based on the state change, in this case prey movement, is relatively easy even when the prey is far from the hunter. 5 Simulation: Continuous Case
In order to test the effectiveness of the MMRL architecture for control, we applied the MLQC algorithm described in section 3.3 to the task of swinging up a pendulum with limited torque (see Figure 4) (Doya, 2000). The driving torque T is limited in [¡T max , T max ] with T max < mgl. The pendulum has to be swung back and forth at the bottom to build up enough momentum for a successful swing up.
1360
Kenji Doya et al.
Figure 3: Example of value functions and prediction models learned by MMRL after 10,000 trials. Each slot in the grid shows the position of the prey relative to the hunter, which was used as the state x. (a) The state value functions Vi ( x ) . (b) The prediction model outputs Fi ( xO , x, u ) , where the current state x of the prey was xed as (2, 1), shown by the circle, and the action u was xed as ”stay.”
The state space was two-dimensional: x D (h , hP ) 0 2 [¡p , p ] £ R , where h is the joint angle (h D 0 means the pendulum hanging down). The action was u D T. The reward was given by the height of the tip and the negative squared torque: 1 r (x, u ) D ¡ cosh ¡ RT2 . 2
(5.1)
A trial was started from random joint angle h 2 [¡p / 4, p / 4] with no angular velocity. We devised the following automatic annealing process for the parameter s of the softmax function for the responsibility signal, equation 2.26, sk C 1 D gaEk C (1 ¡ g)sk,
(5.2)
where k denotes the number of trial and E k is the average state prediction error during the kth trial. The parameters were g D 0.25, a D 2, and the initial value set as s0 D 4. 5.1 Task Decomposition in Space: Nonlinear Control. We rst used two modules, each of which had a linear dynamic model (see equation 3.9) and a quadratic reward model (see equation 3.10). The centers of the local linear dynamic models were initially placed randomly with the angular component in [¡p , p ].
Multiple Model-Based Reinforcement Learning
1361
Each trial was started from a random position of the pendulum and lasted for 30 seconds. Figure 4 shows an example of swing-up performance from the bottom position. Initially, the rst prediction model predicts the pendulum motion better than the second one, so the responsibility signal l1 becomes close to 1. Thus, the output of the rst RL controller u1 , which destabilizes the bottom position, is used for control. As the pendulum is driven away from the bottom, the second prediction model predicts the movement better, so l2 becomes higher and the second RL controller takes over and stabilizes the upright position. Figure 5 shows the changes of linear prediction models and quadratic reward models before and after learning. The two linear prediction models approximated the nonlinear gravity term. The rst model predicted the negative feedback acceleration around the equilibrium state with the pendulum hanging down. The second model predicted the positive feedback acceleration around the unstable equilibrium with the pendulum raised up. The two reward models also approximated the cosine reward function using parabolic curves. Figure 6 shows the dynamic and reward models when there were eight modules. Two modules were specialized for the bottom position, three modules were specialized near the top position, and two other modules were centered somewhere in between. The result shows that proper modularization is possible even when there are redundant modules. Figure 7 compares the time course of learning by MLQC with two, four, and eight modules and a nonmodular actor-critic (Doya, 2000). Learning was fastest with two modules. The addition of redundant modules resulted in more variability in the time course of learning. This is because there were multiple possible ways of modular decomposition, and due to the variability of the sample trajectories, it took longer for modular decomposition to stabilize. Nevertheless, learning by the eight-module MLQC was still much faster than by the nonmodular architecture. An interesting feature of the MLQC strategy is that qualitatively different controllers are derived by the solutions of the Riccati equations, 3.12. The controller at the bottom is a positive feedback controller that destabilizes the equilibrium where the reward is minimal, while the controller at the top is a typical linear quadratic regulator that stabilizes the upright state. Another important feature of the MLQC is that the modules were exibly switched simply based on the prediction errors. Successful swing up was achieved without any top-down planning of the complex sequence.
5.2 Task Decomposition in Time: Nonstationary Pendulum. We then tested the effectiveness of the MMRL architecture for the nonlinear and nonstationary control tasks in which mass m and length l of the pendulum were changed every trial.
1362
Kenji Doya et al.
Figure 4: (a) Example of swing-up performance. Dynamics are given by ml2hR D ¡mgl sin h ¡ m hP C T. Physical parameters are m D l D 1, g D 9.8, m D 0.1, and T max D 5.0. (b) Trajectory from the initial state (0[rad],0.1[rad/s]). o: start, +: goal. Solid line: module 1. Dashed line: module 2. (c) Time course of the state (top), the action (middle), and the responsibility signal (bottom).
Multiple Model-Based Reinforcement Learning
1363
Figure 5: Development of state and reward prediction of models. (a,b) Outputs of state prediction models (a) before and (b) after learning. (c,d) Outputs of the reward prediction model (c) before and (d) after learning. Solid line: module 1. Dashed line: module 2; dotted line: targets (Rx and r). ±: centers of spatial responsibility prediction c i .
Figure 6: Outputs of eight modules. (a) State prediction models. (b) Reward models.
1364
Kenji Doya et al.
Figure 7: Learning curves for the pendulum swing-up task. The cumulative R 20 reward 0 r ( t ) dt during each trial is shown for ve simulation runs. (a) Two modules. (b) Four modules. (c) Eight modules. (d) Nonmodular architecture.
We used four modules, each of which had a linear dynamic model (see equation 3.9) and a quadratic reward model (see equation 3.10). The centers xi of the local linear prediction models were initially set randomly. Each trial was started from a random position with h 2 [¡p / 4, p / 4] and lasted for 40 seconds. We implemented responsibility prediction with tc D 50, tM D 200, and tp D 0.1. The parameters of annealing were g D 0.1, a D 2, and an initial value of s0 D 10. In the rst 50 trials, the physical parameters were xed at fm D 1.0, l D R 1.0g. Figure 8a shows the change in the position gain (fA21g D @@hh ) of the four prediction models. The control performance is shown in Figure 8b. Figures 8c, 8d, and 8e show the outputs of prediction models in the section of fh D 0, T D 0g. Initial position gains are set randomly (see Figure 8c). After 50 trials, both modules 1 and 2 specialized in the bottom region (h ’ 0) and learned similar prediction models. Modules 3 and 4 also learned the same prediction model in the top region (h ’ p ) (see Figure 8d). Accordingly, the RL controllers in modules 1 and 2 learned a reward model with a minimum near (0, 0) 0 , and
Multiple Model-Based Reinforcement Learning
1365
Figure 8: Time course of learning and changes of the prediction models. (a) R Changes of a coefcient A21 D @@hh of the four prediction models, coefcient with angle. (b) Change of average reward during each trial. Thin lines: results of 10 simulation runs. Thick line: average to 10 simulation runs. Note that the average reward with the new, longer pendulum was lower even after successful learning because of its longer period of swinging. (c,d, and e) Linear prediction models in the section of fh D 0, T D 0g (c) before learning, (d) after 50 trials with xed parameters, and (e) after 150 trials with changing parameters. Slopes of linear models correspond to A21 shown in a.
a destabilizing feedback policy was given by equations 2.15 through 2.17. Modules 3 and 4 also learned a reward model with a peak near (p , 0) 0 and implemented a stabilizing feedback controller. In 50 to 200 trials, the parameters of the pendulum were switched between fm D 1, l D 1.0g and fm D 0.2, l D 10.0g in each trial. At rst, the degenerated modules tried to follow the alternating environment (see Figure 8a), and thus swing up was not successful for the new, longer pendulum. The performance for the shorter pendulum was also disturbed (see Figure 8b). After about 80 trials, the prediction models gradually specialized in either new or learned dynamics (see Figure 8e), and successful swing up was achieved for both the shorter and longer pendulums.
1366
Kenji Doya et al.
We found similar module specialization in 6 of 10 simulation runs. In 4 other runs, due to the bias in initial module allocation, three modules were aggregated in one domain (top or bottom) and one model covered the other domain during the stationary condition. However, after 150 trials in the nonstationary condition, module specialization, as shown in Figure 8e, was achieved. 6 Discussion
We proposed an MMRL architecture that decomposes a nonlinear or nonstationary task in space and time based on the local predictability of the system dynamics. We tested the performance of the MMRL in both nonlinear and nonstationary control tasks. It was shown in simulations of the pendulum swing-up task that multiple prediction models were successfully trained and corresponding model-based controllers were derived. The modules were specialized for different domains in the state space. It was also conrmed in a nonstationary pendulum swing-up task that available modules are exibly allocated for different domains in space and time based on the task demands. The modular control architecture using multiple prediction models was proposed by Wolpert and Kawato as a computational model of the cerebellum (Wolpert et al., 1998; Wolpert & Kawato, 1998). Imamizu et al. (1997, 2000) showed in fMRI experiments of novel tool use that a large area of the cerebellum is activated initially, and then a smaller area remains active after long training. They proposed that such local activation spots are the neural correlates of internal models of tools (Imamizu et al., 2000). They also suggested that internal models of different tools are represented in separated areas in the cerebellum (Imamizu et al., 1997). Our simulation results in a nonstationary environment can provide a computational account of these fMRI data. When a new task is introduced, many modules initially compete to learn it. However, after repetitive learning, only a subset of modules are specialized and recruited for the new task. One might argue whether MLQC is a reinforcement learning architecture since it uses LQ controllers that were calculated off-line. However, when the linear dynamic models and quadratic reward models are learned on-line, as in our simulations, the entire system realizes reinforcement learning. One limitation of MLQC architecture is that the reward function should have helpful gradients in each modular domain. A method for backpropagating the value function of the successor module as the effective reward for the predecessor module is under development. In order to construct a hierarchical RL system, it appears necessary to combine both top-down and bottom-up approaches for task decomposition. The MMRL architecture provides one solution for the bottom-up approach. Combination of this bottom-up mechanism with a top-down mechanism is the subject of our ongoing study.
Multiple Model-Based Reinforcement Learning
1367
Appendix: Modied Compositional Q-Learning
On each time step, the gating variable gi ( t) is given by the prior probability of module selection, in this case from the assumption of temporal continuity, equation 2.22: li ( t ¡ 1) a . g i ( t ) D Pn a j D 1 lj ( t ¡ 1)
(A.1)
The composite Q-values for state x ( t) are then computed by QO ( x ( t) , u ) D
n X
gi ( t) Qi ( x ( t) , u ) ,
(A.2)
i D1
and an action u ( t) is selected by P (u | x ( t) ) D P
eb QO (x ( t) , u ) . eb QO (x ( t) , v )
(A.3)
v2U
After the reward r ( x ( t) , u ( t) ) is acquired and the state changes to x ( t C 1), the TD error for the module i is given by ei ( t) D r (x ( t) , u ( t) ) C c max QO (x ( t C 1) , u ) ¡ Qi ( x ( t) , u (t) ) . u
(A.4)
From gaussian assumption of value prediction error, the likelihood of module i is given by 1
¡ e P (ei ( t) | i) D e 2s 2 i
( t) 2
,
(A.5)
and thus the responsibility signal, or the posterior probability for selecting module i, is given by ¡
1
e (t ) 2
gi ( t) e 2s 2 i li ( t) D P . ¡ 1 2 ej ( t) 2 2s j gj ( t ) e
(A.6)
The Q values of each module are updated with the weighted TD error li (t ) ei ( t) as the error signal. The discount factor was set as c D 0.9 and the greediness parameters as b D 1 for both MMRL and CQ-L. The decay parameter of temporal responsibility predictor was a D 0.8 for MMRL. We tried different values of a for CQ-L without success. The value used in Figures 2 and 3 was a D 0.99.
1368
Kenji Doya et al.
Acknowledgments
We thank Masahiko Haruno, Daniel Wolpert, Chris Atkeson, Jun Tani, Hidenori Kimura, and Raju Bapi for helpful discussions. References Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difcult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13, 834–846. Bertsekas, D. P. (1995). Dynamic programming and optimal control. Belmont, MA: Athena Scientic. Cacciatore, T. W., & Nowlan, S. J. (1994). Mixture of controllers for jump linear and non-linear plants. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing system, 6. San Mateo, CA: Morgan Kaufmann. Dayan, P., & Hinton, G. E. (1993). Feudal reinforcement learning. In C. L. Giles, S. J. Hanson, & J. D. Cowan (Eds.), Advances in neural information processing systems, 5 (pp. 271–278). San Mateo, CA: Morgan Kaufmann. Doya, K. (2000). Reinforcement learning in continuous time and space. Neural Computation, 12, 215–245. Gomi, H., & Kawato, M. (1993). Recognition of manipulated objects by motor learning with modular architecture networks. Neural Networks, 6, 485–497. Haruno, M., Wolpert, D. M., & Kawato, M. (1999). Multiple paired forwardinverse models for human motor learning and control. In M. S. Kearns, S. A. Solla, & D. A. Cohen (Eds.), Advances in neural information processing systems, 11 (pp. 31–37). Cambridge, MA: MIT Press. Haruno, M., Wolpert, D. M., & Kawato, M. (2001). MOSAIC model for sensorimotor learning and control. Neural Computation, 13, 2201–2220. Imamizu, H., Miyauchi, S., Sasaki, Y., Takino, R., Putz, ¨ B., & Kawato, M. (1997). Separated modules for visuomotor control and learning in the cerebellum: A functional MRI study. In A. W. Toga, R. S. J. Frackowiak, & J. C. Mazziotta (Eds.), NeuroImage: Third International Conference on Functional Mapping of the Human Brain (Vol. 5). Copenhagen, Denmark: Academic Press. Imamizu, H., Miyauchi, S., Tamada, T., Sasaki, Y., Takino, R., Putz, ¨ B., Yoshioka, T., & Kawato, M. (2000). Human cerebellar activity reecting an acquired internal model of a new tool. Nature, 403, 192–195. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3, 79–87. Littman, M., Cassandra, A., & Kaelbling, L. (1995). Learning policies for partially observable environments: Scaling up. In A. Prieditis & S. Russel (Eds.), Machine Learning:Proceedings of the 12th International Conference (pp. 362–370). San Mateo, CA: Morgan Kaufmann. Morimoto, J. & Doya, K. (2001). Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning. Robotics and Autonomous Systems, 36, 37–51.
Multiple Model-Based Reinforcement Learning
1369
Narendra, K. S., Balakrishnan, J., & Ciliz, M. K. (1995, June). Adaptation and learning using multiple models, switching and tuning. IEEE Control Systems Magazine, 37–51. Parr, R., & Russel, S. (1998). Reinforcement learning with hierarchies of machines. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural informationprocessing systems,10 (pp. 1043–1049).Cambridge, MA: MIT Press. Pawelzik, K., Kohlmorge, J., & Muller, ¨ K. R. (1996). Annealed competition of experts for a segmentation and classication of switching dynamics. Neural Computation, 8, 340–356. Schaal, S., & Atkeson, C. G. (1996). From isolation to cooperation: An alternative view of a system of experts. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 605–611). Cambridge, MA: MIT Press. Singh, S. P. (1992). Transfer of learning by composing solutions of elemental sequential tasks. Machine Learning, 8, 323–340. Sutton, R. S. (1988). Learning to predict by the methods of temporal difference. Machine Learning, 3, 9–44. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning. Cambridge, MA: MIT Press. Sutton, R., Precup, D., & Singh, S. (1999). Between MDPS and semi-MDPS: A framework for temporal abstraction in reinforcement learning. Articial Intelligence, 112, 181–211. Tani, J., & Nol, S. (1999). Learning to perceive the world as articulated: An approach for hierarchical learning in sensory-motor systems. Neural Networks, 12, 1131–1141. Wiering, M., & Schmidhuber, J. (1998). HQ-learning. Adaptive Behavior, 6, 219– 246. Wolpert, D. M., & Ghahramani, Z. (2000). Computational principles of movement neuroscience. Nature Neuroscience, 3, 1212–1217. Wolpert, D. M., & Kawato, M. (1998). Multiple paired forward and inverse models for motor control. Neural Networks, 11, 1317–1329. Wolpert, D. M., Miall, R. C., & Kawato, M. (1998). Internal models in the cerebellum. Trends in Cognitive Sciences, 2, 338–347. Received June 12, 2000; accepted November 8, 2001.
LETTER
Communicated by Ning Qian
A Bayesian Approach to the Stereo Correspondence Problem Jenny C. A. Read
[email protected] University Laboratory of Physiology, Oxford, OX1 3PT, U.K. I present a probabilistic approach to the stereo correspondence problem. Rather than trying to nd a single solution in which each point in the left retina is assigned a partner in the right retina, all possible matches are considered simultaneously and assigned a probability of being correct. This approach is particularly suitable for stimuli where it is inappropriate to seek a unique partner for each retinal position—for instance, where objects occlude each other, as in Panum’s limiting case. The probability assigned to each match is based on a Bayesian analysis previously developed to explain psychophysical data (Read, 2002). This provides a convenient way to incorporate constraints that enable the ill-posed correspondence problem to be solved. The resulting model behaves plausibly for a variety of different stimuli. 1 Introduction
A fundamental problem facing the visual system is how to extract information about a three-dimensional world from a two-dimensional retinal image. One clue in this task is provided by retinal disparity: the difference in the position of an object’s images in the left and right eyes, arising from the horizontal displacement of the eyes in space. Evidently, this calculation depends on correctly matching each feature in the left retinal image with its counterpart in the right retina. This task has become known as the correspondence problem. In a simple visual scene such as that illustrated in Figure 1, this presents little difculty. However, in a more complicated visual scene, the correspondence problem may be highly complex and indeed is ill posed. In general, there are many possible solutions of the correspondence problem, each implying a different arrangment of physical objects. Despite this, the visual system is capable of arriving, almost instantaneously, at a judgment of disparity across a scene. This implies that the visual system must be using additional constraints to select a solution. Bayesian probability theory (Knill & Richards, 1996) provides a natural way of framing these constraints. Bayesian models of perception typically envisage an observer attempting to deduce information about the visual scene, S, given an image I. In the context of stereopsis, S represents the location of objects in space, and I represents the pair of retinal images. c 2002 Massachusetts Institute of Technology Neural Computation 14, 1371– 1392 (2002) °
1372
Jenny C. A. Read
Figure 1: How probability relates to perception in physical space. The eyes (circles) are xating on object F , whose image thus falls at the fovea in both retinas. Object P is in front of the xation point, and thus has positive (crossed) disparity d D xL ¡ xR , where xL , xR are the horizontal positions of the image of P in the left and right retinas. The distance d describes how far the object at P is in front of the xation point F , while the angle xc describes how far it is to the right of F . 2I is the interocular distance, and a is the vergence angle. Under the approximation that the xation point is sufciently distant and all objects viewed are sufciently close to it that the angles a, xL , xR , and xc are all small, it can be shown that xc ¼ ( xL C xR ) / 2 and d ¼ ( xL ¡ xR ) I / 2a2 . Each potential match between a point ( xL , y) in the left retina and a point ( xR , y ) in the right retina implies a percept of an object at the corresponding location in space P, with luminance depending on the mean of the light intensities recorded in the two retinas. The strength of the perception is presumed to increase monotonically with the probability Pf(xL , y ) $ ( xR , y ) g assigned to the match (zero probability D no perception).
Due to noise within the system, a given conguration of objects does not necessarily produce the same image on successive presentations. However, we assume that the imaging system and its limitations is well characterized, so that the brain knows the likelihood P(I | S ) of obtaining an image I given a particular scene S. Furthermore, we assume that the brain has its own a priori estimate of the probability P(S) that a particular scene occurs. Then Bayes’ theorem allows us to deduce the posterior probability of a particular
Stereo Correspondence Problem
1373
scene S, given that we receive the image I: P(S | I ) D P(I | S ) P(S) / P(I) . For instance, in the double-nail illusion, human observers presented with two nails in the midsagittal plane report a clear perception of two nails with zero disparity in the frontoparallel plane (Krol & van der Grind, 1980; Mallot & Bideau, 1990). This solution is preferred over the physically correct solution, even though other cues, such as size and shading, mean that the latter presumably has the higher likelihood P(I | S ). One way of explaining this within a Bayesian framework is to postulate a prior preference for small disparities. Then the solution in which both nails have zero disparity is assigned a higher posterior probability than the other solution, in which one nail has crossed disparity and the other uncrossed. The zero-disparity solution is thus the one perceived. A second advantage of a probabilistic approach is its ability to handle ambiguity. Several existing models of the correspondence problem (Dev, 1975; Nelson, 1975; Marr & Poggio, 1976, 1979; Grimson, 1981; Pollard, Mayhew, & Frisby, 1985; Sanger, 1988) employ a uniqueness constraint. That is, they seek a unique match in the right eye for every point in the left, and vice versa. This could be implemented within a Bayesian scheme by taking the correct match at every point to be that which has the highest posterior probability. However, although computationally convenient in avoiding false matches, a uniqueness constraint is clearly not satised in practice. Parts of the visual scenes are often occluded from one or the other eye; for instance, a stereogram consisting of a disparate target superimposed on a zero-disparity background may contain regions that have no match in the other eye. Conversely, occluding stimuli may require one point in the left image to be matched with two points in the right image. Experimental evidence suggests that the human visual system does indeed produce double matches in this situation (McKee, Bravo, Smallman, & Legge, 1995). An algorithm avoiding the uniqueness constraint is capable of deriving the correct solution (McLoughlin & Grossberg, 1998). Finally, although algorithms incorporating the uniqueness constraint have been able to solve stereograms incorporating transparency (Qian & Sejnowski, 1989; Pollard & Frisby, 1990), the uniqueness constraint certainly appears less than ideal for such stimuli (Westheimer, 1986; Weinshall, 1989). A probabilistic theory is well suited to such situations. Let Pf(xL , y ) $ ( xR , y )g denote the probability that the point ( xL , y ) in the left retina corresponds to the point ( xR , y ) in the right retina, that is, both points are images of the same object. High probability can be assigned to several matches for a given object without necessarily having to implement a uniqueness constraint by deciding on only one match as being “correct.” I adopt the following working hypothesis of how the match probability relates to perception, illustrated in Figure 1. If the point (xL , y ) in the left retina corresponds to the point ( xR , y ) in the right, then both must be viewing
1374
Jenny C. A. Read
an object located at an angle xc D ( xL C xR ) / 2 to the straightahead direction, and at a distance d D ( xL ¡ xR ) I / 2a2 in front of the xation point (see Figure 1). I propose that a nonzero match probability Pf(xL , y ) $ ( xR , y )g implies a perceptual experience of an object at the corresponding point in space, with luminance equal to the mean luminance of the retinal images, [IL ( xL , y ) C IR (xR , y) ] / 2. I propose that the strength of the perceptual experience depends on the probability assigned to the match, Pf(xL , y) $ ( xR , y )g, with higher probabilities creating a clearer perception. In previous work (Read, 2002), I developed a computational model based on these Bayesian ideas. This model was designed to explain the results of psychophysical experiments involving two-interval forced-choice discrimination of the sign of stereoscopic disparity (crossed versus uncrossed) (Read & Eagle, 2000). It was tested only with stimuli whose disparity was constant across the entire image, and the model reported the most likely value of this global disparity. This procedure was appropriate for modeling the results of our two-interval forced-choice discrimination experiments, and the simulations captured the key features of the psychophysical data across a variety of spatial frequency and orientation bandwidths. However, the model actually calculated internally a detailed probabilistic disparity map of the stimulus, assigning a probability to every potential match. For the previous article, this information was then compressed into a single estimate of stimulus disparity. Now I wish to probe the probabilistic disparity map in more detail, testing the model on stimuli whose disparity varies across the image, to see whether the model can reconstruct disparity in a way that qualitatively captures the behavior of human subjects. 2 Methods 2.1 Structure of the Model. The model is described in full mathematical detail in Read (2002), and only an outline is given here. Following earlier work by Qian (1994) and Prince and Eagle (2000a), the model was designed to incorporate the known physiology presumed to underlie binocular vision. Thus, rather than applying a Bayesian analysis to the original retinal images, the retinal images are rst processed by model simple cells, which are assumed to be linear with an output nonlinearity of half-wave rectication (Movshon, Thompson, & Tolhurst, 1978; Anzai, Ohzawa, & Freeman, 1999). The receptive elds of these cells are Gabor functions (the product of a gaussian and a cosine), described by the spatial frequency and orientation to which they are tuned, and by the bandwidth of this tuning. For simplicity, all simple cells in my model have a spatial frequency bandwidth of 1.5 octaves and an orientation bandwidth of 30 degrees, dened as the full width at half height of the tuning curve. This is in accordance with psychophysical and physiological evidence for the bandwidths of channels in the visual system (Manseld & Parker, 1993; de Valois, Albrecht, & Thorell, 1982; de Valois, Yund, & Hepler, 1982; de Valois & de Valois, 1988). These model simple
Stereo Correspondence Problem
1375
cells then feed into disparity-tuned model complex cells, whose response is simulated using the energy model developed by Adelson & Bergen (1985), Fleet, Wagner, & Heeger (1996), Ohzawa, De Angelis, & Freeman (1990, 1997), and analyzed by Qian (1994). All simple cells feeding into a single complex cell are assumed to be tuned to the same spatial frequency and orientation; they differ only in the phase of their Gabor receptive eld and in their position on the retina. This model uses quadrature pairs of simple cells—simple cells whose receptive elds differ in phase by p / 2. The difference between the positions of the receptive elds in left and right retina denes the disparity to which the complex cell is tuned. My model employs only tuned-excitatory complex cells, which respond maximally to stimuli at their preferred disparity.
2.2 The Bayesian Analysis. A complex cell of this form can be used to test the hypothesis that the region of the stimulus within the complex cell’s receptive eld has the disparity that the complex cell is tuned to. If this were the case, then the part of the image falling within the complex cell’s left retinal receptive eld would be expected to be identical to the part of the image viewed by the right retinal receptive eld. Thus, the response of the binocular complex cell could actually be predicted from a knowledge of the ring rates of simple cells with receptive elds in one eye only. This observation forms the basis of the Bayesian analysis carried out by the present computational model. The model assumes that the left and right retinal images are independently degraded by noise. Any other sources of noise within the visual system are neglected. The retinal noise means that even if the stimulus does have exactly the disparity that the complex cell is tuned to, there will nevertheless usually be some slight discrepancy between the actual ring rate of the complex cell and that predicted by considering the response of simple cells with receptive elds in just one eye. However, the distribution of this discrepancy can be calculated analytically given the amplitude of the retinal noise. Hence, one can deduce the probability of obtaining the observed binocular complex cell ring rate, given the ring rate of the monocular simple cells from one eye, on the assumption that the disparity the complex cell is tuned to is that actually present in the stimulus. According to Bayes’ theorem, this calculation can then be inverted to arrive at the probability that the stimulus really does have the disparity that the complex cell is tuned to, given the observed ring rates of the binocular complex cell itself and the monocular simple cells that feed into it. This calculation applies only to the patch of the stimulus falling within the complex cell’s receptive eld. In addition, since the complex cell is tuned to a particular spatial period l and orientation h, it applies only to that part of the Fourier spectrum of the stimulus that falls within the complex cell’s bandpass region. I thus refer to this probability as the local single-channel
1376
Jenny C. A. Read
match probability (Read, 2002): Plh f(xL , y ) $ (xR , y ) g. This is the probability that the region of the left retinal image centered on (xL , y ) corresponds to the region of the right image centered on ( xR , y ). Note that both RFs are assumed to have the same vertical position y: I include model complex cells tuned to horizontal disparities only. l, h are the orientation and spatial period to which the complex cell is tuned. I include several different orientation tunings (0± , 30± , 60± , 90± , 120± , 150± ) ranging from horizontal to vertical, and several different spatial frequencies (1, 2, 4, 8, 16 cycles per image) designed to cover the full range of frequencies visible to humans (de Valois & de Valois, 1988). Precisely what is meant by “local” in this context depends on the channel under consideration. The probability analysis within each channel implements a local smoothness constraint—that is, it assumes the stimulus disparity is constant across the complex cell’s receptive eld. This is a region in each retina whose extent scales with the spatial period l to which the complex cell is tuned to (for the bandwidths employed here, it is approximately 0.277 £ 0.506l). I now average the local single-channel match probability over all spatial frequency and orientation channels to arrive at a local match “probability” to which all channels contribute: Pf(xL , y ) $ ( xR , y) g D Slh Plh f(xL , y ) $ ( xR , y ) g. This averaging process is purely heuristic. A full mathematically valid treatment would require the joint probability of obtaining a particular set of ring rates from the entire population of complex cells. The motivation for this approach is discussed in Read (2002). Thus, Pf(xL , y ) $ ( xR , y ) g is strictly not a probability, although I shall refer to it as such. It is regarded as an estimate of the probability that the position ( xL , y ) in the left retina corresponds to the position ( xR , y ) in the right retina, that is, that the stimulus disparity in the vicinity of this region is d D xL ¡ xR . The “vicinity” represents an average over the receptive elds of the different channels, whose area and orientation are different for each channel. This local match probability Pf(xL , y ) $ ( xR , y) g forms the probabilistic disparity map, describing how likely each potential match (xL , y ) $ ( xR , y) is. 2.3 Details of the Model Parameters. The Bayesian prior, describing the a priori probability Pfdg that a stimulus has the disparity d, is taken to have the form
Pfdg D [D2 C (d ¡ D / 2) 2 ]¡3/ 2 C [D2 C (d C D / 2) 2 ] ¡3 / 2 . This closely resembles a gaussian function but is less sharply peaked at the origin and decays less steeply (Read, 2002). D is a scale parameter, effectively describing which disparities are counted as “small.” In the original
Stereo Correspondence Problem
1377
article (Read, 2002), D was one of two free parameters that were systematically adjusted in order to produce a good t to experimental results, the other being the level of retinal noise. I found it necessary to postulate a very low noise level, just 0.075% of the contrast of the binary random dot patterns used here and in the previous article. Then a very tight prior of 2.4 arcmin was necessary in order to reproduce the observed decline in performance as stimulus disparity increased. Although the model was constructed on the presumption that the brain introduces rather little noise and that the major noise affecting the calculation arises at the inputs (for instance, spontaneous photoisomerizations, photon shot noise, and the limitation of the eye’s optics), the tted noise level is extremely low and may not be realistic. The low noise was found to be necessary in order to prevent the model nding incorrect “reversed-phi” (Anstis & Rogers, 1975) matches in dense anticorrelated stereo stimuli, which would disagree with human psychophysics (Julesz, 1971; Cogan, Lomakin, & Rossi, 1993; Cumming, Shapiro, & Parker, 1998). With correlated stimuli, good matches to human psychophysics could be obtained with much higher noise levels. The very low noise level may thus be an artifact of other inadequacies of the model—for instance, its failure to incorporate any non-Fourier mechanisms that might suppress false matches in anticorrelated stimuli. Alternatively, it is possible that such low noise levels are achieved by averaging uncorrelated noise over a population. In this view, a single unit in the model would represent a small local population of identical physiological units, in which noise tends to be averaged away. All simulations presented here use the value of noise and prior scale-length D tted to psychophysical data (Read, 2002). The model was originally developed to account for the results of psychophysical experiments (Read & Eagle, 2000) in which the stimuli were 128 £ 128 pixels at a distance of 127 centimeters, subtending an angle 1.7± £ 1.7± . Accordingly, the model retina was originally constructed to be 128 £ 128 pixels, where each pixel represents an angle of 0.8 arcmin on the retina. In this article, I also use a more detailed model retina of 256 £ 256 pixels. This is constructed to represent the same visual angle of 1.7± £ 1.7± , meaning that each pixel now represents 0.4 arcmin. The pixel values of the spatial periods of the model’s channels and the prior scale length D were accordingly doubled. The computer memory and runtime required for simulations depend on the number of different simple and complex cells included. The simulations presented here use 128 different horizontal positions of receptive eld (RF) centers. This dense sampling makes the model highly sensitive to variations in disparity across the stimulus. The simulations use simple cell RFs at just one vertical position in the model retina. Note that this does not mean that the model uses information only from a single horizontal strip across the image, since the RFs themselves extend over a wide region of the image.
1378
Jenny C. A. Read
2.4 Perception. My hypothesis is that the local match probability Pf(xL , y ) $ ( xR , y )g determines which matches are consciously perceived. Several different matches may be perceived for the same feature. In section 3, I show plots of Pf(xL , y ) $ (xR , y) g, plotted against xL and xR for a particular horizontal section through the retina. These plots imply a distribution of probability in space, for a particular horizontal plane in front of the observer. Lines of constant disparity d D xL ¡ xR run diagonally up across the ( xL , xR ) plot; these correspond to frontoparallel lines in space. Lines of constant xc D ( xL C xR ) / 2 run diagonally down across the (xL , xR ) plot; these correspond to radial lines from the observer, at an angle xc to straight ahead. Incorporating a prior preference for small disparities, as implied by psychophysical data, inevitably means that lower posterior probability will be assigned to matches with nonzero disparities than those with zero, even if the matches are equally valid and thus have the same likelihood. But in our perceptual experience, disparate regions within Panum’s fusional limit are perceived as clearly as regions with zero disparity. This may imply that the relationship between probability and perception saturates, so that all probabilities above a certain threshold cause the same clarity of perception. This article postulates only that clarity of perception increases (not necessarily strictly) monotonically with probability. 2.5 Maximal Complex Cell Response. For comparison, I also consider an extension of the model of Qian (1994) to multiple spatial frequency and orientation channels. Qian’s model extracts, for each line of constant xc D (xL C xR ) / 2, the disparity d D xL ¡ xR for which the complex cell ring rate is maximal. This yields a disparity map giving the best disparity as a function of position xc across the image: best ( ) dlh xc D argmax Clh ( xc , d) d
One simple way of extending this to multiple spatial frequency and orientation channels would be to sum the complex cell responses across all channels and extract the disparity, for each xc , at which this summed response is maximal: X d best (xc ) D argmax Clh (xc , d) d
l,h
However, I found that this method gave noisy results; it was poor at extracting the disparity of the target region in a random dot stereogram. Better results were obtained by extracting the maximum within each channel independently and combining these by constructing a “maxima eld” M(xc , d) , best ( ) where the value of M(xc , d) is the number of channels that had dlh xc D d. This “maxima eld” shares several properties with the Bayesian approach already described. It is independent of how we choose to relate ring rates
Stereo Correspondence Problem
1379
across different channels (whether by making all simple cells respond with the same ring rate to their optimal sinusoidal grating, or some other method). In the Bayesian approach, this was achieved by converting all ring rates into the common language of probability—here, by allowing each channel to signal only the position where its response is maximum rather than the value of that maximum. Similarly, Qian’s approach is capable of matching a single point xL in the left retina with several different points xR in the right retina. The form of uniqueness constraint it imposes is that along each line of sight xc , there should be only one disparity d. In the present version, this constraint is imposed on each channel separately. A conventional disparity map, such as those plotted by Qian (1994), could then be extracted from M(xc , d) by dening the disparity perceived at each xc to be that where M(xc , d) is maximal. To achieve good matches to human psychophysics, some weighting function would have to be imposed that favored small disparities (Cleary & Braddick, 1990; Prince & Eagle, 2000a, 2000b; Read & Eagle, 2000); this complication is neglected here. 3 Results 3.1 Random Dot Patterns. I begin by investigating the model’s response to binary random dot patterns, where the false-matching problem is at its most acute. I have previously (Read, 2002) demonstrated that the model performs close to 100% when tested with binary random dot stereograms in a front-back discrimination task. Now, I investigate the probability eld in more detail and examine whether the model is capable of extracting the details of how disparity varies acoss the image. First I consider a central disparate region superimposed on a zero-disparity background. Figure 2 shows the images used in the simulation (A–C), and the results obtained (D–F), considering just the y D 0 horizontal crosssection through the image. Figure 2C shows the total luminance of the left and right images, at y D 0 (horizontal line in Figures 2A and 2B) and different x positions. Figure 2D shows the total response of complex cells, summed over all channels. Figure 2E shows the maxima eld M(xc , d) , obtained by extending the model of Qian (1994) to multiple spatial-frequency and orientation channels. Figure 2F shows the probability eld obtained with the Bayesian approach. Both models succeed in extracting the disparate region. The lines mark regions of each monocular image that have no match in the other eye due to occlusion. Naturally enough, the models cannot assign these a clear disparity, although the imposed smoothness leads to a slight tendency to continue the disparity of adjacent binocularly viewed regions into the occluded region. 3.2 The Double Nail Illusion. Here, I consider the model’s response to the double nail illusion. The model was presented with the images in Figures 3A and 3B. The images in the left and right eyes are identical (apart
1380
Jenny C. A. Read
Figure 2: (A, B) The random dot stereogram used in obtaining the model results shown in the lower two plots. The stereogram is 256 £ 256 pixels, representing 1.7± £ 1.7± of visual angle. It contains a target region of 128 £ 128 pixels, with disparity 18 pixels, centered on a background with zero disparity. The dots’ luminance relative to background is § 1300 times the standard deviation of the model’s retinal noise. (C) The sum of the left and right images, IL ( xL , y) C IR ( xR , y ) , at the vertical position y marked with the line in A and B. (D) The response of complex cells, summed over all spatial frequency and orientation channels. The grayscale at ( xL , xR ) represents the total response of the population of complex cells with left- and right-eye RFs centered on ( xL , y) and ( xR , y ) , respectively. (E) The maxima eld M(xc , d) obtained by generalizing the model of Qian (1994). Within each channel, for each xc , the disparity where the complex cell response was maximal contributes 1 to M(xc , d) . (F) The probabilistic disparity map, in which the grayscale at ( xL , xR ) represents the probability Pf(xL , y) $ ( xR , y ) g that ( xL , y ) matches (xR , y) .
Stereo Correspondence Problem
1381
Figure 3: (A, B) Stimuli for the double-nail illusion. Each image contains two dots, 3 pixels (2.4 arcmin) square, with luminance 1300 times the noise. Only the center 40 pixels are shown. The remaining panels concern the horizontal plane containing the two objects (horizontal line across image plots). Details as for Figure 2.
from the noise), each containing two objects positioned at xL D ¡3 pixels, xR D C 3 pixels. Depending on the correspondence made, this can be interpreted as two bars with disparity 0, positioned at xc D § 3 pixels (2.4 arcmin) in the frontoparallel plane, or as two bars with xc D 0 and disparity § 6 (4.8 arcmin). These four matches are apparent in the plots of total luminance (see Figure 3C) and in the complex cell response (see Figure 3D).
1382
Jenny C. A. Read
Because neither the Bayesian nor the maxima model imposes a uniqueness constraint demanding a one-one match between xL and xR , they could potentially nd all four matches. However, in fact the zero disparity match is favored throughout the image, as is apparent from the dark stripe along the line of zero disparity, xL D xR , in Figures 3E and 3F. Since the background is devoid of features, any region of the background matches equally well with any other region. The smoothing implied by the receptive elds and (in the Bayesian case) the prior preference for small disparities ensure that the background is assigned zero disparity. So far there is little reason to prefer the more elaborate Bayesian analysis to the simpler maxima analysis based on the model of Qian (1994) and Qian and Zhu (1997). I therefore turn now to a stimulus where the Bayesian model performs better. 3.3 Occluding Bars/Panum’s Limiting Case. I consider a famous stimulus that provides a challenge to many existing models of stereopsis (Howard & Rogers, 1995; Qian, 1997) because it violates the uniqueness constraint. The visual scene is supposed to consist of two vertical bars, one centered on xc D 4 and disparity C 6.4 arcmin (here, 8 pixels) and the other centered on xc D ¡4 and disparity ¡6.4 arcmin. Both bars fall at the same position in the left retina, xL D 0, while falling at different positions in the right retina, xR D § 8 pixels. Physically, in the left eye, the nearer bar occludes the farther bar, whereas in the right eye, both are visible. Thus, the correct match requires the position xL D 0 in the left retina to be matched both with position xR D ¡8 and with position xR D 8 in the right retina. The images and the results of the simulation are shown in Figure 4. The probability eld in Figure 4F shows high probability at disparities d D § 8 pixels, as indicated by the dark stripes along the lines drawn at these disparities. Thus, the model successfully nds both correct matches. Because it does not enforce a uniqueness constraint, it is able to match the single bar in the left eye simultaneously with both bars in the right eye. Although the bars themselves are only 3 pixels wide, the probability is nonzero for several pixels along the lines d D § 8, even though the prior preference for small disparities means that blank regions of the image are normally assigned zero disparity. These constant disparity stripes reect the smoothness constraint built into the model. The lowest-frequency simple cells have very large receptive elds, so the match implied by the bars can potentially inuence the probability assigned to very distant matches. The prior preference for zero disparity is thus opposed by the assumption that adjacent points have the same disparity. Under the hypothesis that the probability plotted here underlies perception, the intepretation is that an isolated disparate object tends to produce the perception of being embedded in a disparate frontoparallel plane. This is consonant with perceptual experience and could potentially help the brain reconstruct smooth surfaces from discrete disparate stimuli (Grimson, 1982).
Stereo Correspondence Problem
1383
Figure 4: (A, B) Stimuli for Panum’s limiting case. The right image contains two bars and the left image one. Each bar is 3 pixels (2.4 arcmin) wide, with luminance 13 in arbitrary units. Only the center 40 pixels are shown. Remaining panels as in Figure 2.
In contrast, the maxima eld in Figure 4E still shows clear remnants of the cruciform structure of the raw complex cell response. This leads to a predicted perception of a “ghost” object in between the two objects actually present. (In the stimuli used, this is at the xation point; in general, wherever the bars are with respect to xation, the ghost will lie exactly between them.) The origin of this ghost is clear if one traces along the line xc D 0, marked
1384
Jenny C. A. Read
Figure 5: Stimuli as for Figure 4 except that the bars have disparity § 2 pixels. Only the center 20 pixels are shown. In all panels, the diagonal lines indicate disparity § 2 pixels. (A) The sum of the left and right images, IL ( xL , y ) C IR ( xR , y ) at y D 0. (B) The response of complex cells, summed over all spatial frequency and orientation channels. (C) The probabilistic disparity map Pf(xL , y ) $ ( xR , y ) g. (D) Probability weighted by image intensity: Pf(xL , y ) $ ( xR , y) g £ [IL ( xL , y ) C IR ( xR , y ) ]. This is intended to approximate the perceptual experience. Thus, the model predicts a perception of two bars, disparity slightly greater than 2 pixels.
with a line in the plot of summed complex cell responses (see Figure 4D). Within each channel, the maximum complex cell response along this line occurs at zero disparity. If the bars’ disparity is reduced, the Bayesian model experiences a repulsion illusion. Figure 5 shows the results obtained with bars of disparity § 2 pixels (1.6 arcmin). Now, the model perceives the bars at slightly different positions in the left eye and at slightly larger disparities than veridical. This is especially apparent when we plot the probability eld weighted by the combined images, [IL ( xL , y) C IR ( xR , y ) ], Figure 5D, in an attempt to show the spatial location of the visual objects perceived by the model. This illusion occurs when the separation between the images in the left
Stereo Correspondence Problem
1385
Figure 6: Why the bars in Panum’s limiting case tend to repel at small separations. The ve rows show complex cells tuned to ve different values of disparity d and xc . (A, B) The positions of the complex cell receptive elds in left and right retinas. (C) The corresponding position on the disparity map (circle = RF center). The shading in the disparity map indicates the probability assigned to the match. (D) The effect of all this on the probability map assigned to the stimulus. The bars tend to “repel” each other, being shifted away from each other along both d and xc .
and right eyes becomes smaller than the RF size in the majority of complex cells. Figure 6 explains why. Figures 6A and 6B represent the left and right retinas, with the positions of the bar stimuli in each eye indicated. The left retina contains a single bar at horizontal position xL ; the right retina, two bars at positions xR1 and xR2 . In addition, each plot contains an example complex cell RF. The circle represents the center of the RF in each retina, and the oval indicates the extent of the RF. The ve rows of plots differ in the positions of the complex cell RFs. Figure 6C shows where the RF centers are located in the (xL , xR ) disparity space familiar from the previous gures. The top row shows a complex cell tuned to a correct match, in which the bar in the left eye is correctly identied with a bar in the right eye (clearly
1386
Jenny C. A. Read
there is another correct match, given by the pairing with the other bar in the right eye, which is not shown here). However, although this match is in fact correct, the complex cell shown will accord it low probability. This is because the large RF of this cell extends over both bars in the right retina. The RFs in either eye are thus not stimulated equally, meaning that the calculated probability is low. The total match probability summed over all channels will be relatively low, as represented by the pale shading assigned to this correspondence in the disparity map in Figure 6C. The same effect occurs for the cells shown in the second row, which are tuned to the same disparity but a mean retinal position xc slightly to the right, and for those in the third row, which are tuned to the same mean retinal position xc as those in the rst row, but a smaller disparity. Again, the complex cell reports low probability. The fourth row shows a cell that has the same xc as the rst row but larger disparity; the fth row shows a cell that has the same disparity but smaller xc . Here, exactly one bar falls in each eye’s RFs. Since the RFs are symmetric, the complex cell receives equal stimulation in both eyes, and so signals high probability. This is represented by the dark shading in the disparity map. Figure 6D summarizes the results from the ve rows. As in Figures 2 through 5, the grayscale at the point ( xL , xR ) represents the probability that the point xL is the correct match for xR . The correct match is not assigned the highest probability; matches with larger disparity or lower xc are considered more likely than the correct match. Thus, the bars are perceived shifted away from each other both in the front-back (d) and the left-right ( xc ) direction. This effect is due to channels whose RFs are longer horizontally than the separation between the bars’ images in the right eye. The repulsion is therefore due predominantly to channels tuned to low spatial frequencies or orientations close to horizontal. Similar repulsion illusions have been reported with human subjects (Ruda, 1998; Badcock & Westheimer, 1985) and have been reproduced with a non-Bayesian model also based on energymodel complex cells (Qian, 1997; Mikaelian & Qian, 2000). At smaller bar separations still, human observers display an apparent “pooling” effect: the bars appear to attract rather than repel each other, and the actual disparity perceived is an average of the disparities of each bar. Qian’s model reproduces this effect. This model, however, cannot display such pooling. This is because when two bars falls in one eye’s RF and only one in the other, the two eyes’ RFs are very unequally stimulated; the model naturally assigns a very low probability to such a match. This deciency might be addressed by including some form of local contrast normalization between the left and right images.
Stereo Correspondence Problem
1387
4 Discussion
This article discusses one way of applying a probabilistic approach to the solution of the correspondence problem. Rather than seeking a unique match for every point in each retina, I propose assigning a probability to each potential match between left and right retinal positions. This is interpreted as the probability that an object exists in front of the viewer at the spatial location implied by the match. This probability is assumed to underlie perception: zeros of the probability eld imply no perception of an object at that location, whereas nonzero values of probability imply a perception, whose clarity is presumed to increase with increasing probability, of an object with luminance given by the mean of the values in the left and right retinas (see Figure 1). There are many ways in which the brain might attempt to assign a probability to each potential match. The approach adopted here was motivated by a desire to incorporate as much as possible of the known physiology of binocular vision. The available physiological and psychophysical evidence suggests that initial processing takes place within channels tuned to a particular spatial frequency and orientation. Accordingly, the match probability is initially calculated within a single channel, using a Bayesian analysis based on the outputs of disparity-tuned complex cells and the simple cells that feed into them. This analysis introduces the two key constraints employed by the model in order to overcome the ill-posed nature of the correspondence problem. First, the disparity is smoothed over the RF dimensions in each channel. Second, the Bayesian prior is used to enforce a preference for small disparities. Information from different channels is then combined by averaging the probability reported from each individual channel. This means that information from all spatial scales is handled simultaneously; the model employs no coarse-to-ne hierarchy (Mallot, Gillner, & Arndt, 1996). Bayesian stereo algorithms have a long history in the computer vision literature (Szeliski, 1990; Geiger, Ladendorf, & Yuille, 1992; Chang & Chatterjee, 1992; Scharstein & Szeliski, 1998). This article differs from these in two major respects. First, it attempts to incorporate the known physiology of disparity-tuned cells in primary visual cortex. Thus, it applies Bayesian ideas to the results of previous workers who used linear lters as a matching primitive (Lehky & Sejnowski, 1990; Jones & Malik, 1992; Sanger, 1988; Qian, 1994). The probability eld is then derived (albeit with a number of simplications and approximations) from the statistics of retinal images degraded by gaussian noise and processed by simple and complex cells. In contrast, previous Bayesian models have usually been postulated to take a Gibbs form: exp(-potential/temperature), where the potential is some cost function incorporating punishment for disparity discontinuities, poor luminance matches, and so on. Second, this article postulates that the probability eld directly underlies perception. Previous studies incorporating Bayesian ideas have generally
1388
Jenny C. A. Read
used it as a tool to arrive at a single match between points in the left and right eyes, explicitly including a form of uniqueness constraint (often allowing for occlusion—each point in the left image must match at most one point in the right image). The approach here was designed to allow for the possibility of multiple matches. The model is closest to that developed by Qian and coworkers (Qian, 1994; Zhu & Qian, 1996; Qian & Zhu, 1997; Qian & Andersen, 1997) and Prince and Eagle (2000a). It differs from their work in incorporating multiple spatial frequency and orientation channels and in processing the complex cell output to assign a probability to each match. The model presented here has a number of serious limitations, which means that it can be regarded only as a preliminary model of stereo correspondence. It was initially developed to model psychophysical data obtained in a front-back depth discrimination task in which images were presented for 130 ms. Although it succeeds admirably in this task, the percept implied by the model lacks the clarity and sharpness of human perceptions. For instance, the ragged outline of the target region perceived in the random dot stereogram of Figure 2 does not accord with the sharp, square outline that humans viewing this gure perceive. This is perhaps not surprising for a model based on the known physiology of primary visual cortex, which incorporates no hierarchy of interactions or iterative processes that sharpen the performance of many other stereo models (cf. Marr & Poggio, 1976; Geiger et al., 1992; Scharstein & Szeliski, 1998). Thus, this model may represent an initial step in stereoscopic perception, providing a sketch of the 3d visual scene, which is then processed by still higher visual areas. The model makes use of a prior probability assigned to each disparity but does not address how the brain could arrive at such a judgment. It does not include any mechanism for updating the prior on the basis of visual experience or information from nonvisual sources. A full Bayesian model would require such a mechanism, based on experimental evidence of how performance develops with training. The model fails to show the correct response with sparse anticorrelated stereograms. Anticorrelated stereograms are those in which the polarity of one eye’s image has been inverted, with black pixels becoming white and vice versa. Dense examples of such stimuli, such as binary random dot patterns, produce little or no impression of depth in most human subjects, although some display a slight tendency to see depth in the opposite direction to that implied by the disparity of the stereogram. The model reproduces this behavior (Read, 2002). However, sparse anticorrelated stimuli give human observers the impression of depth in the veridical direction (von Helmholtz, 1909; Cogan, Kontsevich, Lomakin, Halpern, & Blake, 1995), presumably because boundaries are extracted and matched even though the polarity of the boundaries is opposite. The model presented here assigns low probability to all anticorrelated potential matches and thus has no mechanism for reproducing this behavior.
Stereo Correspondence Problem
1389
Further, the model currently includes no form of contrast normalization between left and right eyes. It seems likely that some such normalization exists, because human observers can fuse stereograms even where considerable contrast differences exist between the monocular images. The model cannot, because any contrast differences between left and right eyes mean that even the correct matches are considered improbable. In addition, the model fails to reproduce the “disparity pooling” observed with disparate bars at very small separations. A suitable scheme of local contrast normalization might be able to correct both of these discrepancies, although considerable work might be required to nd a suitable implementation. Despite these limitations, the model is able to produce plausible solutions of the correspondence problem for a range of stimuli. It extracts the appropriate disparity map in dense random dot stereograms containing several target regions with different disparity. In the double-nail illusion, it perceives only two of four possible matches, in agreement with human observers. Models that impose a strict uniqueness constraint cannot handle Panum’s limiting case, whereas the model presented here performs correctly with this stimulus. This is an advantage of the Bayesian approach over models with a similar physiological basis but a different method of extracting disparity, such as that of Qian (1994) and Qian and Zhu (1997). Finally, it has already been shown (Read, 2002) that the model accurately reproduces psychophysical functions in a front-back discrimination task, for both correlated and anticorrelated random noise stereograms with a range of spatial frequency and orientation bandwidths. Thus, the model combines physiological plausibility with wide explanatory power. Acknowledgments
I am supported by a Mathematical Biology Training Fellowship from the Wellcome Trust. The simulations presented in this article were run with support from the Oxford Supercomputing Centre. I thank Bruce Cumming for helpful discussions. References Adelson, E. H., & Bergen, J. R. (1985). Spatiotemporal energy models for the perception of motion. Journal of the Optical Society of America A, 2(2), 284–299. Anstis, S. M., & Rogers, B. J. (1975). Illusory reversal of visual depth and movement during changes of contrast. Vision Research, 15, 957–961. Anzai, A., Ohzawa, I., & Freeman, R. (1999). Neural mechanisms for processing binocular information I. Simple cells. Journal of Neurophysiology, 82(2), 891– 908. Badcock, D. R., & Westheimer, G. (1985). Spatial location and hyperacuity: Flank position within the centre and surround zones. Spatial Vision, 1(1), 3–11.
1390
Jenny C. A. Read
Chang, C., & Chatterjee, S. (1992). A deterministic approach for stereo disparity calculation. In G. Sandin (Ed.), Computer vision: ECCV’92 (pp. 425–433). New York: Springer-Verlag. Cleary, R., & Braddick, O. (1990). Direction discrimination for band-pass ltered random dot kinematograms. Vision Research, 30(2), 303–316. Cogan, A. I., Kontsevich, L. L., Lomakin, A. J., Halpern, D. L., & Blake, R. (1995). Binocular disparity processing with opposite-contrast stimuli. Perception, 24(1), 33–47. Cogan, A. I., Lomakin, A. J., & Rossi, A. F. (1993). Depth in anticorrelated stereograms: Effects of spatial density and interocular delay. Vision Research, 33(14), 1959–1975. Cumming, B., Shapiro, S. E., & Parker, A. J. (1998). Disparity detection in anticorrelated stereograms. Perception, 27(11), 1367–1377. de Valois, R. L., Albrecht, D. G., & Thorell, L. G. (1982). Spatial frequency selectivity of cells in macaque visual cortex. Vision Research, 22(5), 545–559. de Valois, R. L., & de Valois, K. K. (1988).Spatial vision. Oxford: Oxford University Press. de Valois, R., Yund, E. W., & Hepler, N. (1982). The orientation and direction selectivity of cells in macaque visual cortex. Vision Research, 22(5), 531–544. Dev, P. (1975).Perception of depth surfaces in random-dot stereograms: A neural model. International Journal of Man-Machine Studies, 7, 511–528. Fleet, D., Wagner, H., & Heeger, D. (1996). Neural encoding of binocular disparity: Energy models, position shifts and phase shifts. Vision Research, 36(12), 1839–1857. Geiger, D., Ladendorf, B., & Yuille, A. (1992). Occlusions and binocular stereo. In G. Sandini (Ed.), Computer vision: ECCV’92 (pp. 425–433). New York: Springer-Verlag. Grimson, W. E. (1981). A computer implementation of a theory of human stereo vision. Philosophical Transactions of the Royal Society, London, B: Biological Sciences, 292(1058), 217–253. Grimson, W. E. (1982). A computational theory of visual surface interpolation. Philosophical Transactions of the Royal Society, London, B: Biological Sciences, 298(1092), 395–427. Howard, I. P., & Rogers, B. J. (1995).Binocular vision and stereopsis. Oxford: Oxford University Press. Jones, D. G., & Malik, J. (1992). A computation framework for determining stereo correspondence from a set of linear spatial lters. In G. Sandini (Ed.), Computer vision: ECCV’92 (pp. 395–410). New York: Springer-Verlag. Julesz, B. (1971). Foundations of Cyclopean Perception. Chicago: University of Chicago Press. Knill, D., & Richards, W. (1996). Perception as Bayesian inference. Cambridge: Cambridge University Press. Krol, J. D., & van de Grind, W. A. (1980). The double-nail illusion: Experiments on binocular vision with nails, needles, and pins. Perception, 9(6), 651–669. Lehky, S. R., & Sejnowski, T. J. (1990). Neural model of stereoacuity and depth interpolation based on a distributed representation of stereo disparity. Journal of Neuroscience, 10(7), 2281–2299.
Stereo Correspondence Problem
1391
Mallot, H., & Bideau, H. (1990). Binocular vergence inuences the assignment of stereo correspondences. Vision Research, 30(10), 1521–1523. Mallot, H. A., Gillner, S., & Arndt, P. A. (1996). Is correspondence search in human stereo vision a coarse-to-ne process? Biological Cybernetics, 74(2), 95–106. Manseld, J. S., & Parker, A. (1993). An orientation-tuned component in the contrast masking of stereopsis. Vision Research, 33(11), 1535–1544. Marr, D., & Poggio, T. (1976). Cooperative computation of stereo disparity. Science, 194(4262), 283–287. Marr, D., & Poggio, T. (1979). A computational theory of human stereo vision. Proceedings of the Royal Society, London, B: Biological Sciences, 204(1156), 301– 328. McKee, S. P., Bravo, M. J., Smallman, H. S., & Legge, G. E. (1995).The “uniqueness constraint” and binocular masking. Perception, 24(1), 49–65. McLoughlin, N. P., & Grossberg, S. (1998). Cortical computation of stereo disparity. Vision Research, 38(1), 91–99. Mikaelian, S., & Qian, N. (2000). A physiologically-based explanation of disparity attraction and repulsion. Vision Research, 40(21), 2999–3016. Movshon, J., Thompson, I., & Tolhurst, D. J. (1978). Spatial summation in the receptive elds of simple cells in the cat’s striate cortex. Journal of Physiology, 283, 53–77. Nelson, J. I. (1975). Globality and stereoscopic fusion in binocular vision. Journal of Theoretical Biology, 49(1), 1–88. Ohzawa, I., DeAngelis, G., & Freeman, R. (1990). Stereoscopic depth discrimination in the visual cortex: Neurons ideally suited as disparity detectors. Science, 249(4972), 1037–1041. Ohzawa, I., DeAngelis, G., & Freeman, R. (1997).Encoding of binocular disparity by complex cells in the cat’s visual cortex. Journal of Neurophysiology, 77(6), 2879–2909. Pollard, S. B., & Frisby, J. P. (1990). Transparency and the uniqueness constraint in human and computer stereo vision. Nature, 347(6293), 553–556. Pollard, S. B., Mayhew, J. E., & Frisby, J. P. (1985). PMF: A stereo correspondence algorithm using a disparity gradient limit. Perception, 14(4), 449–470. Prince, S. J. D., & Eagle, R. A. (2000a) Weighted directional energy model of human stereo correspondence. Vision Research, 40(9), 1143–1155. Prince, S. J. D., & Eagle, R. A. (2000b) Stereo correspondence in one-dimensional Gabor stimuli. Vision Research, 40(9), 913–924. Qian, N. (1994). Computing stereo disparity and motion with known binocular cell properties. Neural Computation, 6, 390–404. Qian, N. (1997). Binocular disparity and the perception of depth. Neuron, 18(3), 359–368. Qian, N., & Andersen, R. A. (1997). A physiological model for motion-stereo integration and a unied explanation of Pulfrich-like phenomena. Vision Research, 37(12), 1683–1698. Qian, N., & Sejnowski, T. (1989). Learning to solve random-dot stereograms of dense and transparent surfaces with recurrent backpropagation. In Pro-
1392
Jenny C. A. Read
ceedings of the 1988 Connectionist models summer school (pp. 435–443). London: Harcourt. Qian, N., & Zhu, Y. (1997). Physiological computation of binocular disparity. Vision Research, 37(13), 1811–1827. Read, J. C. A. (2002). A Bayesian model of stereo depth/motion direction discrimination. Biological Cybernetics, 86(2), 117–136. Read, J. C. A., & Eagle, R. A. (2000). Reversed stereo depth and motion direction with anti-correlated stimuli. Vision Research, 40(24), 3345–3358. Ruda, H. (1998). The warped geometry of visual space near a line assessed using a hyperacuity displacement task. Spatial Vision, 11(4), 401–419. Sanger, T. (1988). Stereo disparity computation using Gabor lters. Biological Cybernetics, 59, 405–418. Scharstein, D., & Szeliski, R. (1998). Stereo matching with nonlinear diffusion. International Journal of Computer Vision, 28, 155–174. Szeliski, R. (1990). Bayesian modelling of uncertainty in low-level vision. International Journal of Computer Vision, 5, 271–301. von Helmholtz, H. (1909). Handbuch der physiologischen Optik. Hamburg, Germany: Voss. Weinshall, D. (1989). Perception of multiple transparent planes in stereo vision. Nature, 341(6244), 737–739. Westheimer, G. (1986). Panum’s phenomenon and the conuence of signals from the two eyes in stereoscopy. Philosophical Transactions of the Royal Society, London, B: Biological Sciences, 228(1252), 289–305. Zhu, Y. D., & Qian, N. (1996). Binocular receptive eld models, disparity tuning, and characteristic disparity. Neural Computation, 8(8) , 1611–1641. Received April 3, 2001; accepted October 15, 2001.
LETTER
Communicated by Chris Williams
Learning Curves for Gaussian Process Regression: Approximations and Bounds Peter Sollich
[email protected] Anason Halees
[email protected] Department of Mathematics, King’s College London, London WC2R 2LS, U.K.
We consider the problem of calculating learning curves (i.e., average generalization performance) of gaussian processes used for regression. On the basis of a simple expression for the generalization error, in terms of the eigenvalue decomposition of the covariance function, we derive a number of approximation schemes. We identify where these become exact and compare with existing bounds on learning curves; the new approximations, which can be used for any input space dimension, generally get substantially closer to the truth. We also study possible improvements to our approximations. Finally, we use a simple exactly solvable learning scenario to show that there are limits of principle on the quality of approximations and bounds expressible solely in terms of the eigenvalue spectrum of the covariance function. 1 Introduction
Within the neural networks community, there has in the past few years been a good deal of excitement about the use of gaussian processes (GPs) as an alternative to feedforward networks (see, e.g., Williams & Rasmussen, 1996; Williams, 1997; Barber & Williams, 1997; Goldberg, Williams, & Bishop, 1998; Sollich, 1999a; Malzahn & Opper, 2001). The advantages of GPs are that prior assumptions about the problem to be learned are encoded in a very transparent way and that inference—at least in the case of regression that we will consider—is relatively straightforward. GPs are also “nonparametric” in the sense that their effective number of parameters (degrees of freedom) can grow arbitrarily large as more and more training data are collected. Finally, interest in GPs has also been stimulated by the fact that they are at the heart of the large family of kernel machine learning methods (see, e.g., www.kernel-machines.org). One crucial question for applications is then how fast GPs learn—that is, how many training examples are needed to achieve a certain level of generalization performance. The typical (as opposed to worst-case) behavior is c 2002 Massachusetts Institute of Technology Neural Computation 14, 1393– 1428 (2002) °
1394
Peter Sollich and Anason Halees
captured in the learning curve, which gives the average generalization error 2 as a function of the number of training examples n. Several workers have derived bounds on 2 (n ) (Michelli & Wahba, 1981; Plaskota, 1990; Opper,
1997; Trecate, Williams, & Opper, 1999; Opper & Vivarelli, 1999; Williams & Vivarelli, 2000) or studied its large n asymptotics (Silverman, 1985; Ritter, 1996). As we will illustrate below, however, the existing bounds are often far from tight, and asymptotic results will not necessarily apply for realistic sample sizes n. Our main aim in this article is therefore to derive approximations to 2 ( n ) that get closer to the true learning curves than existing bounds, and apply for both small and large n. We compare these approximations with existing bounds and the results of numerical simulations; possible improvements to the approximations are also discussed. Finally, we study an analytically solvable example scenario that sheds light on how tight bounds on learning curves can be made in principle. Summaries of the early stages of this work have appeared in conference proceedings (Sollich, 1999a, 1999b). In its simplest form, the regression problem that we are considering is this: We are trying to learn a function h ¤ that maps inputs x (real-valued vectors) to (real-valued scalar) outputs h ¤ ( x ). We are given a set of training data D, consisting of n input-output pairs (xl , yl ) ; the training outputs yl may differ from the clean target outputs h ¤ ( xl ) due to corruption by noise. Given a test input x, we are then asked to come up with a prediction h (x ) for the corresponding output, expressed either in the simple form of a mean prediction hO ( x ) plus error bars, or more comprehensively in terms of a predictive distribution P (h ( x) |x, D) . In a Bayesian setting, we do this by specifying a prior P (h ) over our hypothesis functions and a likelihood P ( D|h ) with which each h could have generated the training data; from this, we deduce the posterior distribution P (h |D) / P (D|h ) P (h ). If we wanted to use a feedforward network for this task, we could proceed as follows: Specify candidate networks by a set of weights w, with prior probability P ( w) . Each network denes a (stochastic) input-output relation described by the distribution of output y given input x (and weights w), P ( y|x, w) . Multiplying over the whole data set, we get the probability of the observedQdata having been produced by n ( ). the network with weights w: P ( D|w) D l D1 P yl |xl , w Bayes’ theorem then gives us the posterior—the probability of network w given the data— as P ( w|D ) / P ( D|w) P ( w) up to an overall normalization R factor. From this, nally, we get the predictive distribution P ( y|x, D ) D dwP ( y|x, w) P ( w|D) . This solves the regression problem in principle, but leaves us with a nasty integral over all possible network weights: the posterior P (w|D ) generally has a highly nontrivial structure, with many local peaks (corresponding to local minima in the training error). One therefore has to use sophisticated Monte Carlo integration techniques (Neal, 1993) or local approximations to P (w|D ) around its maxima (MacKay, 1992) to tackle this problem. Even once this has been done, one is still left with the question of how to interpret the results. We may, for example, want to select priors on the basis of the
Learning Curves for Gaussian Process Regression
1395
data, by making the prior P ( w|h ) dependent on a set of hyperparameters h and choosing h such as to maximize the probability P ( D|h) of the data. Once we have found the optimal prior, we would then hope that it tells us something about the regression problem at hand (whether certain input components are irrelevant, for example). This would be easy if the prior revealed directly how likely certain input-output functions are; instead, we have to extract this information from the prior over weights, often a complicated process. By contrast, for a GP it is an almost trivial task to obtain the posterior and the predictive distribution (see below). One reason for this is that the prior P (h ) is dened directly over input-output functions h. How is this done? Any h is uniquely determined by its output values h (x ) for all x from the input domain, and for a GP, these are simply assumed to have a joint gaussian distribution (hence, the name). This distribution can be specied by the mean values hh ( x) ih (which we take in the following, as is com« to be zero ¬ monly done), and the covariances h ( x )h (x0 ) h D C (x, x0 ) ; C ( x, x0 ) is called the covariance function of the GP. It encodes in an easily interpretable way prior assumptions about the function to be learned. Smoothness, for example, is controlled by the behavior of C ( x, x0 ) for x0 ! x. The Ornstein-Uhlenbeck (OU) covariance function C ( x, x0 ) / exp(¡||x ¡ x0 || / l) produces very rough (nondifferentiable) functions, while functions sampled from the radial basis function (RBF) prior with C ( x, x0 ) / exp[¡||x ¡x0 || 2 / (2l2 ) ] are—in the meansquare sense—innitely differentiable. (Intermediate priors yielding r times mean-square differentiable functions can also be dened by using modied Bessel functions as covariance functions; see Stein, 1989.) Figure 1 illustrates these characteristics with two samples from the OU and RBF priors, respectively, over a two-dimensional input domain. The length scale parameter l in the covariance functions also has an intuitive meaning. It corresponds directly to the distance in input space over which we expect our function to vary signicantly. More complex properties can also be encoded; by replacing l with different length scales for each input component, for example, relevant (small l) and irrelevant (large l) inputs can be distinguished. How does inference with GPs work? We give only a brief summary here and refer to existing reviews on the subject (see, e.g., Williams, 1998) for details. It is simplest to assume that outputs y are generated from the clean values of a hypothesis function h ( x ) by adding gaussian noise of x-independent variance s 2 . The joint distribution of a set of n training outputs fyl g and the function values h ( x ) are then also gaussian, with covariances given by «
¬ yl ym D C ( xl , xm ) C s 2dlm D (K ) lm « ¬ ylh (x ) D C ( xl , x ) D ( k ( x) ) l , where we have dened an n £n matrix K and an x-dependent n-component vector k ( x ). The posterior distribution P (h |D) is then obtained by simply
1396
Peter Sollich and Anason Halees
Figure 1: Samples h ( x ) drawn from GP priors over functions on [0, 1]2 . (Top) OU covariance function, C ( x, x0 ) D exp (¡|| x ¡ x0 || / l ) . (Bottom) RBF covariance function, C ( x, x0 ) D exp[¡||x ¡ x0 || 2 / (2l2 ) ]. The length scale l D 0.1 determines in both cases over what distance the functions vary signicantly. Note the difference in roughness of the two functions; this is related to the behavior of the covariance functions for x ! x0 .
Learning Curves for Gaussian Process Regression
1397
conditioning on the fyl g. It is again gaussian and has mean hO ( x, D ) ´ hh ( x) ih |D D k ( x) T K ¡1 y
(1.1)
and variance D
2 2 ( x, D ) ´ (h ( x) ¡ hO ( x ) )
E h|D
D C ( x, x ) ¡ k ( x ) T K ¡1 k ( x ).
(1.2)
Equations 1.1 and 1.2 solve the inference problem for a GP. They provide us directly with the predictive distribution P (h (x ) |x, D ) . The posterior variance, equation 1.2, in fact also gives us the expected generalization error (or Bayes error) at x. Why? If the true target function ish ¤ , the squared deviation between our mean prediction and the true output1 is (hO ( x) ¡ h ¤ (x ) ) 2 ; averaging this over the posterior distribution of target functions P (h ¤ |D) gives equation 1.2. The underlying assumption is that our assumed GP prior is the true one from which the target functions are actually generated (and that we are using the correct noise model). Otherwise, the expected generalization error is larger and given by a more complicated expression (Williams & Vivarelli, 2000). In line with most other work on the subject, we consider only the “correct prior” case in the following. Averaging the generalization error at x over the distribution of inputs gives D
T ¡1 2 ( D) D h2 ( x, D) ix D C ( x, x ) ¡ k ( x ) K k ( x)
E x
.
(1.3)
This form of the generalization error, which is well known (Michelli & Wahba, 1981; Opper, 1997; Williams, 1998; Williams & Vivarelli, 2000), still depends on the training inputs; the fact that the training outputs have dropped out already is a signature of the fact that GPs are linear predictors (compare equation 1.1). Averaging over data sets yields the quantity we are after: 2
D h2 ( D ) iD .
(1.4)
This average expected generalization error (we will drop the “average expected” in the following) depends on only the number of training examples n; the function 2 ( n ) is called the learning curve. Its exact calculation is difcult because of the joint average in equations 1.3 and 1.4 over the training inputs xl and the test input x. Before proceeding with our calculation of the learning curve 2 ( n ) , let us try to gain some intuitive insight into its dependence on n. Consider a simple 1
One can also measure the generalization error by the squared deviation between the prediction hO (x) and the noisy version of the true output; this simply adds a term s 2 to equation 1.3.
1398
Peter Sollich and Anason Halees
e(x,D)
n=2
1
n = 100
0.02
0.8
0.015
0.6 0.01 0.4 0.2 0
0.005
s2 0
0.2
0.4
x
0.6
0.8
1
0
0
0.2
0.4
x
0.6
0.8
1
Figure 2: Generalization error 2 ( x, D ) as a function of input position x 2 [0, 1], for noise level s 2 D 0.05, RBF covariance function C ( x, x0 ) D exp[¡|x ¡x0 | 2 / (2l2 )] with l D 0.1, for randomly drawn training sets D of size n D 2 (left) and n D 100 (right). To emphasize the difference in scale, the plot on the left also includes the results for n D 100 shown on the right, just visible below the dashed line at 2 ( x, D ) D s 2 .
example scenario, where inputs x are one-dimensional and drawn randomly from the unit interval [0, 1], with uniform probability. For the covariance function, we choose an RBF form, C ( x, x0 ) D exp[¡|x ¡ x0 | 2 / (2l2 ) ] with l D 0.1. Here, we have taken the prior variance C ( x, x ) as unity; as seems realistic for most applications, we assume the noise level to be much smaller than this, s 2 D 0.05. Figure 2 illustrates the x-dependence of the generalization error 2 ( x, D ) for a small training set (n D 2). Each of the examples has made a “dent” in 2 ( x, D ) , with a shape that is similar to that of the covariance function.2 Outside the dents, 2 ( x, D ) still has essentially its prior value, 2 ( x, D ) D 1; at the center of each dent, it is reduced to a much smaller value, 2 ( x, D ) ¼ s 2 / (1 C s 2 ) (this approximation holds as long as the different training inputs are sufciently far away from each other). The generalization error 2 ( D ) is therefore dominated by regions where no training examples have been seen; one has 2 ( D ) À s 2 , and the precise value of 2 ( D ) depends only very little on s 2 (assuming always that s 2 ¿ 1). Gradually, as n is increased, the covariance dents will cover the input space, so that 2 ( x, D) and 2 ( D ) become of order s 2 ; this situation is shown on the right of Figure 2. From this point on, further training examples essentially have the effect of averaging out noise, eventually making 2 ( D) ¿ s 2 for large enough n. In 2 More precisely, the dents have the shape of the square of the covariance function. If the training inputs xi are sufciently far apart, then around each xi we can neglect the inuence of the other data points and apply equation 1.2 with n D 1, giving 2 (x, D) ¼ C(x, x) ¡ C2 (x, xi ) / [C(xi , xi ) C s 2 ].
Learning Curves for Gaussian Process Regression
1399
summary, we expect the learning curve 2 ( n ) to have two regimes. In the initial (small n) regime, where 2 ( n ) À s 2 , 2 (n ) is essentially independent of s 2 and reects mainly the geometrical distribution of covariance dents across the inputs space. In the asymptotic regime (n large enough such that 2 ( n ) ¿ s 2 ), the noise level s 2 is important in controlling the size of 2 ( n ) because learning arises mainly from the averaging out of noise in the training data. The remainder of this article is structured as follows. In the next section, we derive several approximations to GP learning curves, starting from an exact representation in terms of the eigenvalues and eigenfunctions of the covariance function. In section 3, we compare these approximations to previous bounds and nd, across a range of learning scenarios, that they generally get rather closer to the true learning curves. This section also includes extensions of two existing bounds to a larger domain of applicability. In section 4, we investigate the potential for improving the accuracy of our approximations further. Finally, section 5 deals with the question of whether there are limits of principle on the quality of learning curve approximations and bounds expressible solely in terms of the eigenvalues of the covariance function; we nd that such limits do indeed exist. In the nal section, we summarize our results and give an overview of the challenges that remain. 2 Approximate Learning Curves
Calculating learning curves for GPs exactly is a difcult problem because of the joint average in equations 1.3 and 1.4 over the training inputs xl and the test input x. Several workers have therefore derived upper and lower bounds on 2 (Michelli & Wahba, 1981; Plaskota, 1990; Opper, 1997; Williams & Vivarelli, 2000) or studied the large n asymptotics of 2 ( n ) (Silverman, 1985; Ritter, 1996). As we will illustrate below, however, the existing bounds are often far from tight; similarly, asymptotic results can only capture the large n regime dened above and will not necessarily apply for sample sizes n occurring in practice. We therefore now attempt to derive approximations to 2 ( n ) that get closer to the true learning curves than existing bounds, and are applicable for both small and large n. As a starting point for an approximate calculation of 2 (n ) , we rst derive a representation of the generalization error in terms of the eigenvalue decomposition of the covariance function. Mercer’s theorem (see, e.g., Wong, 1971) tells us that the covariance function can be decomposed into its eigenvalues li and eigenfunctions w i ( x ) : C ( x, x0 ) D
1 X
li w i ( x) w i ( x0 ) .
(2.1)
iD 1
This is simply the analog of the eigenvalue decomposition of a nite symmetric matrix. We assume here that eigenvalues and eigenfunctions are
1400
Peter Sollich and Anason Halees
dened relative to the distribution over inputs x, i.e., «
C ( x, x0 )w i ( x0 )
¬ x0
D li w i ( x ).
(2.2)
The eigenfunctions are then orthogonal with respect to the same distribu« ¬ tion, w i (x ) wj ( x ) x D dij (see, e.g., Williams & Seeger, 2000). Now write the data-dependent generalization error, equation 1.3, as D
E
¡1 k ( x) k ( x) T 2 ( D ) D hC(x, x ) ix ¡ tr K
x
and perform the x-average: D
( k ( x) k ( x ) T ) lm
E x
D
X ij
X « ¬ li lj w i ( xl ) w i ( x )w j ( x) x wj ( xm ) D l2i w i ( xl ) w i (xm ). i
This suggests introducing the diagonal matrix ( ¤ ) ij D lidij and the design « ¬ matrix ( ª ) li D w i ( xl ) , so that k (x ) k ( x) T x D ª ¤ 2 ª T . One then also has hC(x, x ) ix D tr ¤ ,
(2.3)
and the matrix K is expressed as K D s 2I C ª
¤ ª
T
,
with I being the identity matrix. Collecting these results, we get 2 ( D ) D tr ¤
¡ tr(s 2 I C ª ¤ ª
T ¡1 )
ª ¤
2
ª
T
.
Now one can apply the Woodbury formula (Press, Teukolsky, Vetterling, & Flannery, 1992), which states that ( A C UV T ) ¡1 D A ¡1 ¡ A ¡1 U ( I C V T A ¡1 U ) ¡1 V T A ¡1 for matrices A , U , V of appropriate size. Setting A D ¤ ¡1 , U D V D s ¡1 ª and using the cyclic invariance of the trace, one then nds 2 ( n ) D h2 ( D) iD with3 2 ( D) D tr( ¤
¡1
C s ¡2 ª
T
ª ) ¡1 D tr ¤ ( I C s ¡2 ¤ ª
T
ª ) ¡1 .
(2.4)
The advantage of this (still exact) representation of the generalization error is that the average over the test input x has already been carried out and that the remaining dependence on the training inputs is contained entirely 3
¡1
If the covariance function has zero eigenvalues, the inverse ¤ does not exist, and the second form of 2 (D) given in equation 2.4 must be used; similar alternative forms, though not explicitly written, exist for all following results.
Learning Curves for Gaussian Process Regression
1401
in the matrix ª T ª . It also includes as a special case the well-known result for linear regression (see, e.g., Sollich, 1994); ¤ ¡1 and ª T ª can be interpreted as suitably generalized versions of the weight decay (matrix) and input correlation matrix. Starting from equation 2.4, one can now derive approximate expressions for the learning curve 2 ( n ) . The most naive approach is to neglect entirely the uctuations in ª T ª over different data sets and replace it by its average, ¬ P « which is simply h( ª T ª ) ij iD D nlD 1 w i ( xl ) wj ( xl ) D D ndij . This leads to 2 OV ( n) D tr( ¤
¡1
C s ¡2 nI ) ¡1 .
(2.5)
While this is not, in general, a good approximation, it was shown by Opper and Vivarelli (1999) to be a lower bound (called the OV bound below) on the learning curve. It becomes tight in the large noise limit s 2 ! 1 at constant elements of the matrix s ¡2 ª T ª then become n / s 2 : The uctuations of thep p ¡2 2 vanishingly small (of order ns D ( n / s ) / n ! 0), and so replacing ª T ª by its average is justied. By a similar argument, one also expects that (for any xed s 2 > 0) the OV bound will become increasingly tight as n increases. To derive better approximations, it is useful to see how the matrix G D ( ¤ ¡1 C s ¡2 ª T ª ) ¡1 changes when a new example is added to the training set. Using the Woodbury formula again, this change can be expressed as
G ( n C 1) ¡ G (n ) D [G D ¡
¡1 (
n ) C s ¡2 ÃÃ T ]¡1 ¡ G ( n )
G ( n) ÃÃ T G ( n ) s 2 C Ã T G (n) Ã
(2.6)
in terms of the vector à with elements (à ) i D w i ( xn C 1 ). To get exact learning curves, one would have to average this update formula over both the new training input xn C 1 and all previous ones. This is difcult, but progress can be made by neglecting correlations of numerator and denominator in equation 2.6, averaging them separately instead. Also treating n as a continuous variable, this yields the approximation @G ( n ) @n
«
¬ n) D ¡ 2 , s C tr G ( n )
G
2(
(2.7)
where we have introduced the « ¬notation G D hG i. If we also neglect uctuations in G , approximating G 2 D G 2 , we have @G / @n D ¡G 2 / (s 2 C tr G ). Since at n D 0, the solution G D ¤ is diagonal, it will remain so for all n, and one can rewrite this equation for G (n ) as @G ¡1 / @n D (s 2 C tr G ) ¡1 I. G ¡1 ( n ) thus differs from G ¡1 (0) D ¤ ¡1 only by a multiple of the identity matrix and can be represented as G ¡1 ( n ) D ¤ ¡1 C s ¡2 n0 I , where n0 has to
1402
Peter Sollich and Anason Halees
obey s ¡2 dn0 / dn D [s 2 C tr( ¤ the equation
¡1
C s ¡2 n0 I ) ¡1 ] ¡1 . Integrating, one nds for n0
n0 C tr ln( I C s ¡2 n0 ¤ ) D n, and the generalization error 2 D tr G is given by ¡1
2 UC ( n ) D tr( ¤
C s ¡2 n0 I ) ¡1
(2.8)
By comparison with equation 2.5, n0 can be thought of as an effective number of training examples. The subscript UC in equation 2.8 stands for “upper continuous” (treating n as continuous) approximation. A better approximation with a lower value is obtained by retaining uctuations in G . As in the case of the linear perceptron (Sollich, 1994), this can be achieved by introducing an auxiliary offset parameter v into the denition of G , according to
G
¡1
¡1 C s ¡2 ª D vI C ¤
T
ª .
(2.9)
One can then write D ¡ tr G
2
E D
@ @v
tr hG i D
@2
(2.10)
@v
and obtain from equation 2.7 the partial differential equation, @2
@n
¡
1 @2 D 0 s 2 C 2 @v
(2.11)
This can be solved for 2 ( n, v) using the methods of characteristic curves (see appendix A). Resetting the auxiliary parameter v to zero yields the lower continuous (LC) approximation to the learning curve, which is given by the self-consistency equation: ³ 2 LC ( n ) D tr
¤
¡1
C
n s2 C 2
I
´¡1
.
(2.12)
LC
It is easy to show that 2 LC · 2 UC . One can also check that both approximations converge to the exact result, equation 2.5, in the large noise limit (as dened above). Encouragingly, we see that the LC approximation reects our intuition about the difference between the initial and asymptotic regimes of the learning curve. For 2 À s 2 , we can simplify equation 2.12 to ³ 2 LC ( n ) D tr
¤
¡1
C
n 2 LC
I
´ ¡1
,
Learning Curves for Gaussian Process Regression
1403
where, as expected, the noise level s 2 has dropped out. In the opposite limit, 2 ¿ s 2 , on the other hand, we have ± 2 LC ( n ) D tr
¤
¡1
C
n ²¡1 I , s2
(2.13)
which—again as expected—retains the noise level s 2 as an important parameter. Equation 2.13 also shows that 2 LC approaches the OV lower bound (from above) for sufciently large n. We conclude this section with a brief qualitative discussion of the expected n-dependence of 2 LC (which will turn out to be the more accurate of our two approximations). Obviously, this n-dependence depends on the spectrum of eigenvalues li ; below, we always assume that these are arranged in decreasing order. Consider, then, rst the asymptotic regime 2 ¿ s 2 , where 2 LC and 2 OV become identical. One shows easily that for eigenvalues decaying as a power law for large i, li » i¡r , the asymptotic learning curve scales as 2 LC » ( n / s 2 ) ¡(r¡1) / r .4 This is in agreement with known exact results (Silverman, 1985; Ritter, 1996). In the initial regime 2 À s 2 , on the other hand, one can take s 2 ! 0 and nds then a faster decay of the generalization error, 2 LC » n ¡(r¡1) .5 We are not aware of exact results pertaining to this regime, except for the OU case in d D 1, which has r D 2 and therefore 2 LC » n ¡1 , in agreement with an exact calculation (Manfred Opper, private communication, March 2001). 3 Comparisons with Bounds and Numerical Simulations
We now compare the LC and UC approximations with existing bounds, and with the “true” learning curves as obtained by numerical simulations. A lower bound on the generalization error was given by Michelli and Wahba (1981) as 2 ( n ) ¸ 2 MW ( n ) D
1 X
li .
(3.1)
iD n C 1
This bound is derived for the noiseless case by allowing generalized observations (projections of h ¤ ( x ) along the rst n eigenfunctions of C (x, x0 ) ) 4 Among the scenarios studied in the next section, the OU covariance function provides a concrete example of this kind of behavior. In d D 1 dimension, it has, from equation B.2, li » i¡2 and thus r D 2. The RBF covariance function, on the other hand, has eigenvalues decaying faster than any power law, corresponding to r ! 1 and thus 2 LC » s 2 / n (up to logarithmic corrections) in the asymptotic regime. 5 For covariance functions with eigenvalues li decaying faster than a power law of i, the behavior in the initial regime is nontrivial. For the RBF covariance function in d D 1 with uniform inputs, for example, we nd (for large n) 2 LC » n exp (¡cn2 ) with some constant c.
1404
Peter Sollich and Anason Halees
and so is unlikely to be tight for the case of “real” observations at discrete input points. Also, given that the bound is derived from the s 2 ! 0 limit, it can be useful only in the initial (small n, 2 À s 2 ) regime of the learning curve. There, it conrms a conclusion of our intuitive discussion above: The learning curve has a lower limit below which it will not drop even for s 2 ! 0. Plaskota (1990) generalized the MW approach to the noisy case and obtained the following improved lower bound, 2 ( n) ¸ 2 Pl ( n ) D min fgi g
1 n X X li s 2 C li , g C s 2 iD n C 1 iD 1 i
(3.2)
Pn where P n the minimum is over all nonnegative g1 , . . . , gn obeying iD 1 gi D S ´ lD1 C (xl , xl ) . (For the purposes of numerical evaluation, the minimization can be carried out largely analytically, leaving only a search over an integer variable determining the number of nonzero gi ; Plaskota, 1990; Sollich, 2001.) Plaskota derived his bound only for covariance functions for which the prior variance C ( x, x ) is independent of x. We call these uniform covariance functions; due to the general identity hC(x, x) ix D tr ¤ , they obey C ( x, x ) D tr ¤ (and hence S D n tr ¤ ). By following through Plaskota’s proof, we have shown that the bound actually extends to general covariance functions in the form stated above (Sollich, 2001). The Plaskota bound is close to the MW bound in the small n regime (where one can effectively take s 2 ! 0); for larger n, it becomes substantially larger. It therefore has the potential to be useful for both small and large n. Note that, in contrast to all other bounds discussed in this article, the MW and Plaskota bounds are in fact single data set (worst-case) bounds. They apply to 2 (D ) for any data set D of the given size n rather than just to the average 2 ( n ) D h2 ( D) i.6 Opper (1997) used information-theoretic methods to obtain a different lower bound, but we will not consider this because the more recent OV bound, equation 2.5, is always tighter. Note that the OV bound incorrectly suggests that 2 decreases to zero for s 2 ! 0 at xed n. It therefore becomes void for small n (where 2 À s 2 ) and is expected to be of use only in the asymptotic regime of large n. There is also an upper bound due to Opper (UO; Opper, 1997), ¡2 ¡1 ¡2 2 Q ( n) · 2 UO ( n ) D (s n ) tr ln( I C s n¤
) C tr( ¤
¡1
C s ¡2 nI ) ¡1 .
(3.3)
Here 2 Q is a modied version of 2 that (in the rescaled version that we are using) becomes identical to 2 in the limit of small generalization errors 6 Our generalized version P n of the Plaskota bound depends on the specic data set only ). through the value of S D lD 1 C(xl , xl To obtain an average case bound, one would need to average over the distribution of S.
Learning Curves for Gaussian Process Regression
1405
(2 ¿ s 2 ), but never gets larger than 2s 2 ; for small n in particular, 2 (n ) can therefore actually be much larger than 2 Q ( n ) and its bound, equation 3.3. For this reason, and because in our simulations we never get very far into the asymptotic regime 2 ¿ s 2 , we do not display the UO bound in the gures that follow. The UO bound is complemented by an upper bound due to Williams and Vivarelli (WV; Williams & Vivarelli, 2000), which never decays below values around s 2 and is therefore mainly useful in the initial regime 2 À s 2 . It applies for one-dimensional inputs x and stationary covariance functions— for which C ( x, x0 ) D Cs ( x ¡ x0 ) is a function of x ¡ x0 alone—and reads: Z 1 1 (3.4) 2 ( n ) · 2 WV ( n ) D Cs (0) ¡ da fn ( a) C2s ( a) Cs (0) C s 2 0 with fn ( a) D 2(1 ¡ a ) n H (1 ¡ a ) C 2(n ¡ 1) (1 ¡ 2a) n H (1 ¡ 2a) ,
(3.5)
and where the Heaviside step functions H (dened as H (z ) D 1 for z > 0 and D 0 otherwise) in the two terms in equation 3.5 imply that only values of a up to 1 and 1 / 2, respectively, contribute to the integral in equation 3.4. The function fn ( a) is a normalized distribution over a, which for n ! 1 becomes peaked around a D 0, implying that the asymptotic value of the bound is 2 WV ( n ! 1) D Cs (0) s 2 / [Cs (0) C s 2 ] ¼ s 2 for s 2 ¿ Cs (0) . The derivation of the bound is based on the insight that 2 ( x, D ) always decreases as more examples are added; it can therefore be upper bounded for any given x by the smallest 2 ( x, D0 ) that would result from training on any data set D0 comprising only a single example from the original training set D. The idea can be generalized to using the smallest 2 ( x, D0 ) obtainable from any two of the training examples, but this does not signicantly improve the bound (Williams & Vivarelli, 2000). As stated in equations 3.4 and 3.5, the WV bound applies only to the case of a uniform input distribution over the unit interval [0, 1]. However, it is relatively straightforward to extend the approach to general (one-dimensional) input distributions P ( x) ; only the data set average becomes technically a little more complicated. We omit the details and quote only the result: Equation 3.4 remains valid if the expression 3.5 for fn ( a) is generalized to D E (3.6) fn ( a) D n P ( x ¡ a) [1 ¡ Q ( x) ]n¡1 C P ( x C a) [Q(x) ]n¡1 D
x
0
0
C n ( n ¡ 1) H ( x ¡ x ¡ 2a) [P(x C a) C P ( x ¡ a )]
£[1 C Q ( x ) ¡ Q ( x0 )]n¡2
E
x,x0
,
Rz where Q (z ) D ¡1 dx P (x ) is the cumulative distribution function. In the simpler scenario considered by Williams and Vivarelli, this can be shown
1406
Peter Sollich and Anason Halees
to reduce to equation 3.5, while in the most general case, the numerical evaluation of the bound requires a triple integral (over x, x0 , and a). Finally, there is one more upper bound, due to Trecate, Williams, and Opper (TWO; Trecate et al., 1999). Based on the generalization error achieved by a “suboptimal” gaussian regressor, they showed 2 ( n ) · 2 TWO ( n ) D tr ¤
¡n
X li , i ci
where D E ci D ( n ¡ 1) li C s 2 C C (x, x) w i2 ( x ) .
(3.7)
x
For a« uniform ¬ covariance function, the average in the denition of ci becomes tr ¤ w i2 ( x ) x D tr ¤ , and the bound simplies to 2 TWO ( n ) D
X i
li
tr ¤ tr ¤
C s 2 ¡ li
C s 2 C ( n ¡ 1)li
.
We now compare the quality of these bounds and our approximations with numerical simulations of learning curves. All the theoretical expressions require knowledge of the eigenvalue spectrum of the covariance function, so we focus on situations where this is known analytically. We consider three scenarios. For the rst two, we assume that inputs x are drawn from the d-dimensional unit hypercube [0, 1]d and that the input density is uniform. As covariance functions, we use the RBF function C (x, x0 ) D exp[¡||x ¡ x0 | | 2 / (2l2 ) ] and the OU function C ( x, x0 ) D exp(¡||x ¡ x0 || / l) , to have the extreme cases of smooth and rough functions to be learned; both contain a tunable length scale l. To be precise, we use slightly modied versions of the RBF and OU covariance functions (using what physicists call periodic boundary conditions), which make the eigenvalue calculations analytically tractable; the details are explained in appendix B. In the third scenario, we explore the effect of a nonuniform input distribution by considering inputs x drawn from a d-dimensional (zero mean) isotropic gaussian distribution P (x ) / exp[¡||x|| 2 / (2sx2 ) ], for an RBF covariance function. Details of the eigenvalue spectrum for this case can also be found in appendix B. Note that in all three cases, the covariance function is uniform, that is, it has a constant variance C ( x, x ) ; we have xed this to unity without loss of generality. This leaves three variable parameters: the input space dimension d, the noise level s 2 , and the length scale l. We generically expect the prior variance to be signicantly larger than the noise on the training data, so we consider only values of s 2 < 1. The length scale l should also obey l < 1; otherwise, the covariance functions C (x, x0 ) would be almost constant across the input space, corresponding to a trivial GP prior of essentially
Learning Curves for Gaussian Process Regression
1407
x-independent functions. We in fact choose the length scale l for each d in such a way as to get a reasonable decay of the learning curve within the range of n D 0, . . . , 300 that can be conveniently simulated numerically. To see why this is necessary, note that each covariance dent covers a fraction of order ld of the input space, so that the number of examples n needed to see a signicant reduction in generalization error 2 will scale as (1 / l) d . This quickly becomes very large as d increases unless l is increased simultaneously. (The effect of larger l leading to a faster decay of the learning curve was also observed by Williams & Vivarelli, 2000.) In the following gures, we show the lower bounds (Plaskota, OV), the nonasymptotic upper bounds (TWO and, for d D 1 with uniform input distribution, WV), and our approximations (LC and UC). The true learning curve as obtained from numerical simulations is also shown. For the numerical simulations, we built up training sets by randomly drawing training inputs from the specied input distribution. For each new training input, the matrix inverse K ¡1 has to be recalculated. By partitioning the matrix into its elements corresponding to the old and new inputs, this inversion can be performed with O (n2 ) operations (see Press et al., 1992), as opposed to O ( n3 ) if the inverse is calculated from scratch every time. With K ¡1 known, the generalization error 2 ( D ) was then calculated from equation 1.3 with the average over x estimated by an average over randomly sampled test inputs. This process was repeated up to our chosen nmax D 300; the results for 2 ( D ) were then averaged over a number of training set realizations to obtain the learning curve 2 ( n ) . In all the graphs shown, the size of the error bars on the simulated learning curve is of the order of the visible uctuations in the curve. In Figure 3, we show the results for an OU covariance function with inputs from [0, 1]d , for d D 1 (top) and d D 2 (bottom). One observes that the lower bounds (Plaskota and OV) are rather loose in both cases. The TWO upper bound is also far from tight; the WV upper bound is better where it can be dened (for d D 1). Our approximations, LC and UC, are closer to the true learning curve than any of the bounds and in fact appear to bracket it. Similar comments apply to Figure 4, which displays the results for an RBF covariance function with inputs from [0, 1]. Because functions from an RBF prior are much smoother than those from an OU prior, they are easier to learn, and the generalization error 2 shows a more rapid decrease with n. This makes visible, within the range of n shown, the anticipated change in behavior as 2 crosses over from the initial (2 À s 2 ) to the asymptotic (2 ¿ s 2 ) regime. The LC approximation and, to a less quantitative extent, the Plaskota bound both capture this change. By contrast, the OV bound (as expected from its general properties discussed above) shows the right qualitative behavior only in the asymptotic regime. Figure 5 displays corresponding results in higher dimension (d D 4), at two different noise levels s 2 . One observes in particular that the OV
1408
Peter Sollich and Anason Halees
1 Sim UC LC WV OV TWO Plas
e
0.1
0
100
n
200
300
0
100
n
200
300
1
e
0.2
Figure 3: Learning curve for a GP with OU covariance function and inputs uniformly drawn from x 2 [0, 1]d , at noise level s 2 D 0.05. (Top) d D 1, length scale l D 0.01. (Bottom) d D 2, l D 0.1.
lower bound becomes looser as s 2 decreases; this is as expected since for s 2 ! 0, the bound actually becomes void (2 OV ! 0). The Plaskota bound also appears to get looser for lower s 2 , though not as dramatically. (Note that the kinks in the Plaskota curve are not an artifact. For larger d, the multiplicities of the different eigenvalues can be quite large; the value of 2 Pl can become dominated by one such block of degenerate eigenvalues, and kinks occur where the dominant block changes.) The TWO upper bound, nally, is only weakly affected by the value of s 2 and quite loose throughout.
Learning Curves for Gaussian Process Regression
1409
1 Sim UC LC WV OV TWO Plas MW
e
0.1
0.01
0
100
n
200
300
Figure 4: Learning curve for a GP with RBF covariance function and inputs uniformly drawn from x 2 [0, 1]d , at noise level s 2 D 0.05, dimension d D 1, length scale l D 0.01. We show the MW bound here to demonstrate how it is rst close to the Plaskota bound but then misses the change in behavior where the generalization error crosses over into the asymptotic regime (2 ¿ s 2 ).
All results shown so far pertain to uniform input distributions (over [0, 1]d ). We now move to the last of our three scenarios: a GP with an RBF covariance function and inputs drawn from a gaussian distribution (see appendix B for details). In Figure 6 we see that in d D 1, the (generalized) WV bound is still reasonably tight, while the LC approximation now provides a less good representation of the overall shape of the learning curve than for the case of uniform input distributions. However, as in all previous examples, the LC and UC approximations still bracket the true learning curve (and come closer to it than the bounds). One is thus led to speculate whether the approximations we have derived are actually bounds. Figure 7 shows this not to be the case, however. In d D 4, the true learning curve drops visibly below the LC approximation in the small n regime, and so the latter cannot be a lower bound. The low noise case (s 2 D 0.001) displayed here illustrates once more that the OV lower bound ceases to be useful for small noise levels. In summary, of the approximations that we have derived, the LC approximation performs best. Although we know on theoretical grounds that it will be accurate for large noise levels s 2 , the examples shown above demonstrate that it produces predictions close to the true learning curves even for the more realistic case of noise levels which are low compared to the prior variance. As a general trend, agreement appears to be better for the case of uniform input distributions.
1410
Peter Sollich and Anason Halees
Figure 5: Learning curve for a GP with RBF covariance function and inputs uniformly drawn from x 2 [0, 1]d , for dimension d D 4 and length scale l D 0.3. (Top) Noise level s 2 D 0.05. (Bottom) s 2 D 0.001. Note that for lower s 2 , the OV bound becomes looser as expected (it approaches zero for s 2 ! 0).
It is interesting at this stage to make a connection to the recent work of Malzahn and Opper (2001); some details of their approach were kindly provided by Manfred Opper in private communication, March 2001. Malzahn and Opper devised an elegant way of approaching the learning curve problem from the point of view of statistical physics, calculating the relevant partition function (which is an average over data sets) using a so-called gaussian variational approximation. The result they nd for the Bayes error is iden-
Learning Curves for Gaussian Process Regression
1411
Figure 6: Learning curve for a GP with RBF covariance function (length scale l D 0.01) and inputs drawn from a gaussian distribution in d D 1 dimension; noise level s 2 D 0.05. Note that the LC approximation is not as good a representation of the overall shape of the learning curve here than for the previous examples with uniform input distributions. The curve labeled WV shows our generalized version of the Williams-Vivarelli bound (see equation 3.6).
tical to the LC approximation under the condition that 2 (x ) D h2 (x, D) iD , the x-dependent generalization error averaged over all data sets, is independent of x. Otherwise, they nd a result of the same functional form, ¡1 C gI ) ¡1 , but the self-consistency equation for g is more com2 ( n ) D tr( ¤ plicated than the simple relation g D n / (2 C s 2 ) obtained from the LC approximation, equation 2.12. The LC approximation would thus be expected to perform less well for such “nonuniform” scenarios. This agrees qualitatively with our above ndings. For the scenario with a gaussian input distribution, the LC approximation is of poorer quality than for the cases with uniform input distributions over [0, 1]d .7 4 Improving the Approximations
In the previous section we saw that in our test scenarios, the LC approximation, equation 2.12, generally provides the closest theoretical approximation to the true learning curves. This may appear somewhat surprising, given
7 It is easy to see that in these cases, 2 (x) is indeed independent of x. The absence of effects from the boundaries of the hypercube comes from the periodic boundary conditions that we are using.
1412
Peter Sollich and Anason Halees
Figure 7: Learning curve for a GP with RBF covariance function (length scale l D 0.3) and inputs drawn from a gaussian distribution in d D 4 dimensions. (Top) Noise level s 2 D 0.05. (Bottom) s 2 D 0.001. The OV lower bound is extremely loose for this smaller noise level. Note that in contrast to all previous examples, there is a range of n here where the true learning curve lies below the LC approximation.
Learning Curves for Gaussian Process Regression
1413
that we made two rather drastic approximations in deriving equation 2.12. We treated the number of training examples n as a continuous variable and decoupled the average of the right-hand side of equation 2.6 into separate averages over numerator and denominator. We now investigate whether the LC prediction for the learning curves can be further improved by removing these approximations. We begin with the effect of n, the number of training examples, taking only discrete (rather than continuous) values. Starting from equation 2.6, averaging numerator and denominator separately as before and introducing the auxiliary variable v as in equations 2.9 and 2.10, we obtain 2 ( n C 1) ¡ 2 ( n ) D
1 @2 s 2 C 2 @v
(4.1)
instead of equation 2.11. It is possible to interpolate between these two equations by writing 1 1 @2 [2 ( n C d) ¡ 2 (n ) ] D 2 . d s C 2 @v
(4.2)
Then d D 1 corresponds to equation 4.1, which is the equation we wish to solve (discrete n), while in the limit d ! 0 we retrieve equation 2.11. To proceed, we treat d as a perturbation parameter and assume that the solution of equation 4.2 can be expanded as 2
where 2 yields
0
´ 2
LC .
D 2 0 C d2 1 C O (d 2 ) ,
Expanding both sides of equation 4.2 to rst order in d
1 @22 0 @2 0 @2 1 Cd C d C O (d 2 ) D 2 @n2 @n @n
³
1 s2 C 2
0
¡d
2 1 (s 2 C 2 0 ) 2
´³
@2 0 @2 1 Cd @v @v
´ .
Comparing the coefcients of the O (d 0 ) terms gives us back equation 2.11 for 2 0 , while from the O (d) terms we get 1 1 @22 0 @2 1 @2 1 2 1 @2 0 ¡ 2 ¡ 2 . D ¡ 2 2 (s ) C 2 0 @v s C 2 0 @v 2 @n @n
(4.3)
This can again be solved using characteristics (see appendix A), with the result n (a22 ¡ a3 ) , (1 ¡ na2 ) 2 ³ 2 ¡k ak D (s C 2 0 ) tr ¤ ¡1 C
2 2 1 D (s C 2 0 )
n s2 C 2
I 0
´¡k
.
(4.4)
1414
Peter Sollich and Anason Halees
1 e0 = eLC 100 e1 e
0
0
100
n
200
300
Figure 8: Solid line: LC approximation 2 LC for the learning curve of a GP with RBF covariance function and gaussian inputs. Parameters are as in Figure 7 (top): dimension d D 4, length scale l D 0.3, noise level s 2 D 0.05. Dashed line: Firstorder correction 2 1 arising from the discrete nature of the number of training examples n. Note that 2 1 has been multiplied by ¡100 to make it positive and visible on the scale of the values of 2 LC .
Setting d back to 1 to have the case of discrete n in equation 4.2, we then have 2 LC1 D 2 0 C 2 1 ´ 2 LC C 2 1
as the improved LC approximation that takes into account the effects of discrete n (up to linear order in a perturbation expansion in d). We see that the correction term, equation 4.4, is zero at n D 0 as it must since 2 LC gives the exact result 2 D tr ¤ there. It can also be shown that 2 1 < 0 for all nonzero n. This can be understood as follows: The decrease of 2 ( n ) becomes smaller (in absolute value) as n increases. Comparing equations 2.11 and 4.1, we see that the continuous n-approximation effectively averages the decrease term over the range n, . . . , n C 1 rather than evaluating it at n itself; it therefore produces a smaller decrease in 2 ( n ). The true decrease for discrete n is larger, and so one expects the correction 2 1 to be negative, in agreement with our calculation. In Figure 8 we show 2 LC and 2 1 for one of the scenarios considered earlier; the results are typical also of what we nd for other cases. The most striking observation is the smallness of 2 1 . Its absolute value is of the order of 1% of 2 LC or less, and consequently 2 LC and 2 LC1 would be indistinguishable on the scale of the plot. On the one hand, this is encouraging. Given that 2 1 is already so small, one would expect higher orders in a perturbation
Learning Curves for Gaussian Process Regression
1415
expansion in d to yield even smaller corrections. Thus, 2 LC1 is likely to be very close to the result that one would nd if the discrete nature of n was taken into account exactly. On the other hand, we also conclude that treating n as discrete is not sufcient to turn the LC approximation into a lower bound on the learning curve; in Figure 6, for example, the curve for 2 LC1 would lie essentially on top of the one for 2 LC and so still be signicantly above the true learning curve for small n. It is clear at this stage that in order to improve the LC approximation signicantly, one would have to address the decoupling of the numerator and denominator averages in equation 2.6. Generally, if a and b are random variables, one can evaluate the average of their ratio perturbatively as * + DaE E ¬ hai C D a hai hai D 1 « D « ¬ D « ¬ C « ¬3 ( D b ) 2 ¡ « ¬2 D aD b C ¢ ¢ ¢ b b C Db b b b up to second order in the uctuations. (This idea was used by Sollich, 1994, to calculate nite N-corrections to the N ! 1 limit of a linear learning problem.) To apply this to the average of the right-hand side of equation 2.6 over the new training input xn C 1 and all previous ones, one would set a D G (n ) ÃÃ T G ( n ) ,
b D s 2 C à T G ( n) à . « ¬ « ¬ « ¬ « ¬ One then sees that averages such « as ab , required in ¬ D aD b D ab ¡hai b , involve fourth-order averages w i ( x ) wj ( x )w k ( x )w l ( x ) x of the components of à . « ¬ In contrast to the second-order averages w i ( x )w j ( x ) D dij , such fourth-order statistics of the eigenfunctions do not have a simple, covariance functionindependent form. Even if these statistics were known, however, one would end up with averages over the entries of the matrix G that cannot be reduced to 2 D tr hG i (for example, by derivatives with respect to auxiliary parameters). Separate equations for the change of these averages with n would then be required, generating new averages and eventually an innite hierarchy that cannot be closed. We thus conclude that a perturbative approach is of little use in improving the LC approximation beyond the decoupling of averages. The approach of Malzahn and Opper (2001) thus looks more hopeful as far as the derivation of systematic corrections to the approximation is concerned. 5 How Good Can Bounds and Approximations Be?
In this nal section, we ask whether there are limits of principle on the quality of theoretical predictions (either bounds or approximations) for GP learning curves. Of course, this question is meaningless unless we specify what information the theoretical curves are allowed to exploit. Guided by the insight that all predictions discussed above depend (at least for uniform covariance functions) on the eigenvalues of the covariance function
1416
Peter Sollich and Anason Halees
only (and of course the noise level s 2 ), we ask: How tight can bounds and approximations be if they use only this eigenvalue spectrum as input? To answer this question, it is useful to have a simple scenario with an arbitrary eigenvalue spectrum for which learning curves can be calculated exactly. Consider the case where the input space consists ofP N discrete points xa; the input distribution is arbitrary, P ( xa ) D pa with a pa D 1. Now take the covariance function to be degenerate in the sense that there are no correlations between different points: C ( xa, xb ) D cadab . The eigenvalue equation, 2.2, then becomes simply X C ( xa , xb ) w ( xb ) pb D ca pa w ( xa ) D lw ( xa ) , b
so that the N different eigenvalues are la D ca pa . The eigenfunctions are ¡1 / 2 w a ( xb ) P D pa dab , where the prefactor follows from the normalization condition c pc w a ( xc ) w b ( xc ) D dab . Note that by choosing the ca and pa appropriately, the la can be adjusted to any desired value in this setup. The same still holds even if we require the covariance function to be uniform, that is, ca to be independent of a. For this case, a simple connection can also be made to the covariance functions discussed in previous sections. If one imagines the points xa arranged in some d-dimensional space, then the present covariance function can be viewed as an RBF covariance function C (x, x0 ) D const ¢ exp[¡| |x ¡ x0 || 2 / (2l2 ) ] in which the correlation length scale l has been taken to zero, so that different input points have entirely uncorrelated outputs. A set of n training inputs is, in this scenario, fully characterized by how often it contains each of the possible inputs xa ; we call these numbers na . The generalization error is easy to work out from equation 2.4 using X ( ª T ª ) ab D nc w a (xc ) w b ( xc ) D ( na / pa )dab . c
This shows that ª 2 ( D) D tr( ¤
D
X a
T
ª
¡1
is a diagonal matrix, and thus from equation 2.4, C s ¡2 ª
T
ª ) ¡1
(la¡1 C s ¡2 na / pa ) ¡1 D
X a
la
s2 C
s2 . na l a / p a
(5.1)
This has the expected form. The contribution of each eigenvalue is reduced according to the ratio of the noise level s 2 and the signal na la / pa D na ca received at the corresponding input point. To average this over all training sets of size n, one notices that na has a binomial distribution, so that Á ! n X X n s2 la . 2 ( n) D pnaa (1 ¡ pa ) n¡na 2 s C na l a / p a na a n D0 a
Learning Curves for Gaussian Process Regression
1417
R1 2 Writing (s 2 C na la / pa ) ¡1 D s ¡2 0 dr rna la / pa s , we can perform the sum over na and obtain Z 1 X 2 la (1 ¡ pa C pa rla / pa s ) n (5.2) 2 ( n) D dr 0
a
as the nal result; the integral over r can easily be performed numerically for a given set of eigenvalues la and input probabilities pa . Note that, having done the calculation for a nite number N of discrete inputs points (and therefore of eigenvalues), we can now also take N to innity and therefore analyze scenarios (such as the ones studied in section 3) with an innite number of nonzero eigenvalues. A simple limiting case will now tell us about the quality of eigenvaluedependent upper bounds on learning curves. Assume that one of the pa is close to 1, whereas all the others are close to 0. From equation 5.2, one then sees that only the contribution from the eigenvalue la with pa ¼ 1 is reduced as n increases 8 while all other ones remain unaffected, so that 2 ( n ) ¼ tr ¤
¡ la C
la s 2 2 s C nla
D tr ¤
¡
nl2a ¸ tr ¤ nla
s2 C
¡ la.
(5.3)
If la ¿ tr ¤ , then we can make the reduction in generalization error arbitrarily small. It thus follows that there is no nontrivial upper bound on learning curves that takes only the eigenvalue spectrum of the covariance function as input. (Accordingly, the two nonasymptotic upper bounds, WV and TWO, discussed in section 3 both contain other information, via the weighted averages of C2s ( x) D C2 (0, x ) in equation 3.4 and the average of C ( x, x) w i2 ( x ) in equation 3.7.) In particular, this implies that our UC approximation cannot be an upper bound (even though the results for all scenarios investigated above suggested that it might be). Furthermore, our result shows that lower bounds on the generalization error (e.g., the OV or Plaskota bounds) can be arbitrarily loose. A similar observation holds for upper bounds on the (data set–averaged) training error, 2 t dened as 2 t D h2 t ( D ) iD ,
where
2 t (D ) D
n 1X (hO (xl , D ) ¡ yl ) 2 . n lD 1
Opper and Vivarelli (1999) showed that their 2 OV actually also provides an upper bound for this quantity (so that the two errors sandwich the bound, 2 t · 2 OV · 2 ). In the present case, it is easy to calculate 2 t explicitly; we omit details and quote just the result: 2 t ¼ 8
la s 2 · la . s 2 C nla
6 a). This holds if n is not too large, more precisely if npb ¿ 1 for all the “small” pb (b D
1418
Peter Sollich and Anason Halees
1 Sim UC LC TWO Deg
e
0.1
0.01
0
100
n
200
300
Figure 9: Learning curve for a GP with RBF covariance function and inputs uniformly drawn from x 2 [0, 1]d . Parameters are as in Figure 5 (top): dimension d D 4, length scale l D 0.3, and noise level s 2 D 0.05. The curves Sim (true learning curve, from numerical simulations), UC, LC, and TWO are also as in Figure 5 (top). The curve labeled Deg shows the exact learning curve for the degenerate scenario (outputs for different inputs are uncorrelated) with exactly the same spectrum of eigenvalues li of the covariance function (and uniform prior variance C ( x, x ) ). The curves Sim and Deg differ signicantly, showing that learning curves cannot be predicted reliably based on eigenvalues alone.
By taking la ¿ tr ¤ , one then sees that the upper bound 2 OV can be made arbitrarily loose for any xed n (and that the ratio of training and generalization error can be made arbitrarily small). One may object that the above limit of most of the pa tending to zero is unrealistic because it implies that the corresponding prior variances ca D la / pa would become very large. Let us therefore now restrict the prior variance to be uniform, ca D c. It then follows that la D c / pa and hence pa D la / tr ¤ . With this assumption, only the la and s 2 remain as parameters affecting the learning curve, equation 5.2. The results for an eigenvalue spectrum from one of the situations covered in section 3 are shown in Figure 9. The main conclusion to be drawn is that the learning curves for the present scenario are quite different from the ones we found earlier, even though the eigenvalue spectra and noise levels are, by construction, precisely identical. This demonstrates that theoretical predictions for learning curves that take into account only the eigenvalue spectrum of a covariance function cannot universally match the true learning curves with a high degree of accuracy; the quality of approximation will vary depending on details of the covariance function and input distribution that are not encoded in the spectrum.
Learning Curves for Gaussian Process Regression
1419
Note that Figure 9 also provides a concrete example for the fact that the UC approximation is not in general an upper bound on the true learning curve; in fact, it here underestimates the true 2 ( n ) quite signicantly. We can also use the scenario to assess whether, as a bound on the generalization error resulting from a single training set, the Plaskota bound, equation 3.2, could be signicantly improved. We focus on the case of uniform covariance functions, where equation 5.1 becomes 2 ( D) D
X a
la
s2 . s 2 C na tr ¤
(5.4)
For P any assignment of the na there is at least one training set of size n D a na for which the generalization error is given by equation 5.4. Minimizing numerically over the na for each given n, we nd the curves shown in Figure 10, where the Plaskota bounds for the same eigenvalue spectra are also shown. The curves are quite close to each other, implying that the Plaskota bound cannot be signicantly tightened as a single data set bound (assuming, as throughout, that the improved bound would again be based on only the covariance function’s eigenvalue spectrum). In the limit s 2 ! 0, the bound, which then reduces to the MW bound (see equation 3.1), cannot be tightened at all, as setting na D 1 for a D 1, . . . , n and na D 0 for a ¸ n C 1 in equation 5.4 shows. Within the simple degenerate scenario introduced in this section, we nally comment briey on a recently proposed universal relation (Malzahn & Opper, 2001; Manfred Opper, private communication, March 2001). Malzahn and Opper suggest considering an empirical estimate of the (Bayes) generalization error, which is obtained by replacing the average over all inputs x by one over the n training inputs xi : 2 emp ( D) D
n 1X 2 ( xi , D ) . n i D1
Within the approximations of their calculation, the data set average of this quantity is then universally linked to a modied version of the true generalization error: ½ ¾ « ¬ 2 ( x) . (5.5) 2 emp ( D ) D D s 2 C 2 (x ) x Note that the average over data sets is on the inside of the fraction on the right-hand side, through the denition of 2 ( x ) D h2 ( x, D ) iD . Within our degenerate scenario, we can calculate both sides of equation 5.5 explicitly, but nd no obvious relation between the two sides. However, if we move the data set average on the right-hand side to the outside, we do (after a
1420
Peter Sollich and Anason Halees
1 1
10 e 10
2
10
3
10
4
2
unif s =0.05 2 unif s =0.001 2 Gauss s =0.05 2 Gauss s =0.001
0
100
200
n
300
Figure 10: Comparison of the Plaskota bound (solid lines) and the lowest generalization error achievable for single data sets of size n within the degenerate scenario (dashed lines). The eigenvalue spectra used to construct the curves are those for an RBF covariance function with length scale l D 0.3, in d D 4 dimensions, and for the input distributions (uniform over [0, 1]d , or gaussian) shown in the legend; the noise level s 2 is also given there. Note that for a given n, the curves become closer for lower s 2 ; this is as expected since for s 2 ! 0, the Plaskota bound can be saturated for a specic data set (see text).
brief calculation, the details of which we omit) nd a simple result: «
2 emp ( Dn C 1 )
¬ Dn C 1
½ ½ D
2 ( x, Dn )
s 2 C 2 (x, Dn )
¾ ¾ .
(5.6)
x Dn
As indicated by the subscripts, the left-hand side of this relation is to be evaluated for data sets of size n C 1 rather than n. The result, equation 5.6, is remarkable in that it holds for any eigenvalue spectrum and any input distribution (within the degenerate scenario considered here). We take this as a hopeful sign that some universal link between true and empirical generalization errors, along the lines derived by Malzahn and Opper (2001) within their approximation, may indeed exist.
Learning Curves for Gaussian Process Regression
1421
6 Conclusion
In summary, we have derived an exact representation of the average generalization error 2 of GPs used for regression, in terms of the eigenvalue decomposition of the covariance function. Starting from this, we obtained two different approximations (LC and UC) to the learning curve 2 ( n ). Both become exact in the large noise limit; in practice, one generically expects the opposite case (s 2 / C ( x, x) ¿ 1), but comparison with simulation results shows that even in this regime, the new approximations perform well. The LC approximation in particular represents the overall shape of the learning curves very well, both for “rough” (OU) and “smooth” (RBF) gaussian priors, and for small as well as for large numbers of training examples n. It is not perfect, but it does generally get substantially closer to the true learning curves than existing bounds (two of which, due to Plaskota and to Williams and Vivarelli, we generalized to a wider range of scenarios). For situations with nonuniform input distributions, the predictions of the LC approximation tend to be less accurate, and we linked this observation to recent work by Malzahn and Opper (2001) on the effects of nonuniformity across input space. Their result, which reduces to the LC approximation for sufciently uniform scenarios, may in other cases provide better approximations, but this has to be traded off against the higher computational cost that would be involved in actually evaluating the predictions. We next discussed how the LC approximation could be improved. The effects of discrete n can be incorporated to leading order but were seen to be relatively minor; on the other hand, the second approximation involved in the derivation (decoupling of averages) appears difcult to improve on within our framework. Finally, we investigated a simple “degenerate” GP learning scenario, where the outputs corresponding to different inputs are uncorrelated. This provided us with a means of assessing whether there are limits on the quality of approximations and bounds that take into account only the eigenvalue spectrum of the covariance function. We found that such limits indeed exist. There can be no nontrivial upper bound on the learning curve of this form, and approximations are necessarily of limited quality because different covariance functions with the same eigenvalue spectrum can produce rather different learning curves. We also found that as a bound on the generalization error for single data sets (rather than its average over data sets), the Plaskota bound is close to being tight. Whether a tight lower bound on the average learning curve exists remains an open question; one plausible candidate worth investigating would be the average generalization error of our degenerate scenario, minimized over all possible input distributions for a xed eigenvalue spectrum. There are a number of open problems. One is whether a subclass of GP learning scenarios can be dened for which the covariance function’s eigenvalue spectrum is sufcient for predicting the learning curves accurately.
1422
Peter Sollich and Anason Halees
Alternatively, one could ask what (minimal) extra information beyond the eigenvalue spectrum needs to be taken into account to arrive at accurate learning curves for all possible GP regression problems. Finally, one may wonder whether the eigenvalue decomposition we have chosen, which explicitly depends on the input distribution, is really the optimal one. On the one hand, recent work (see, e.g., Williams and Seeger, 2000) appears to answer this question in the afrmative. On the other hand, the variability of learning curves among GP covariance functions with the same eigenvalue spectrum suggests that the eigenvalues alone do not provide sufcient information for accurate predictions. One may therefore speculate that eigendecompositions with respect to other input distributions (e.g., maximum entropy ones) might not suffer from this problem. We leave these challenges for future work. Appendix A: Solving for the LC Approximation
In this appendix we describe how to solve equations 2.11 and 4.3 for the LC approximation and its rst-order correction, using the method of characteristic curves. The method applies to partial differential equations of the form a@f / @x C b@f / @y D c, where f D f ( x, y) and a, b, c can be arbitrary functions of x, y, f . Viewing the solution as a surface in x, y, f -space, one can show (John, 1978) that if the point (x 0 , y0 , f0 ) belongs to the solution surface, then so does the entire characteristic curve ( x ( t) , y ( t) , f ( t) ) dened by dx D a, dt
dy D b, dt
df D c, dt
(x (0) , y (0) , f (0) ) D (x 0 , y0 , f0 ).
The solution surface can then be recovered by combining an appropriate one-dimensional family of characteristic curves. Denote the generalization error predicted by the LC approximation as 2 0 ( n, v ), with v the auxiliary parameter introduced in equations 2.9 and 2.10. It is the solution of equation 2.11, 1 @2 0 @2 0 ¡ 2 D 0, s C 2 0 @v @n subject to the initial conditions 2 0 (n D 0, v ) D tr( ¤ ¡1 C vI ) ¡1 . These give us a family of solution points that the characteristic curves have to pass through: (n (0) D 0, v (0) D v 0 , 2 0 (0) D tr(¤ ¡1 C v 0 I ) ¡1 ) . The equations for the characteristic curves are dn D 1, dt
1 dv D ¡ 2 s C2 dt
, 0
d2 0 D 0 dt
Learning Curves for Gaussian Process Regression
1423
and can be integrated to give n D n (0) C t D t v D v (0) ¡
(A.1)
t
s2 C 2
2 0 D 2 0 (0) D tr( ¤
0 ¡1
D v0 ¡ C v 0I
t s2 C 2
) ¡1
0
.
Eliminating t (the curve parameter) and v 0 (which parameterizes the family of initial points) gives the required solution 2 0 D trf¤ ¡1 C [v C n / (s 2 C 2 0 ) ]Ig ¡1 . The LC approximation, equation 2.12, is obtained by setting v D 0. For the rst-order correction 2 1 , we have to solve equation 4.3, 1 1 @22 0 @2 1 @2 1 2 1 @2 0 ¡ 2 ¡ 2 D ¡ , (s C 2 0 ) 2 @v s C 2 0 @v 2 @n2 @n with the initial condition (explained in the main text) 2 1 ( n D 0, v) D 0. Hence, a suitable family of initial solution points is (n (0) D 0, v (0) D v 0, 2 1 (0) D 0) . The characteristic curves must obey dn D 1, dt
1 dv D ¡ 2 s C2 dt
, 0
1 @22 0 d2 1 2 1 @2 0 ¡ 2 . D ¡ (s C 2 0 ) 2 @v 2 @n2 dt
The solutions for n ( t) and v ( t) are the same as before, and as a result, 2 0 is again constant along the characteristic curves. For the derivatives of 2 0 that appear, one nds after some algebra, @2 0 a2 D ¡(s 2 C 2 0 ) 2 , 1 ¡ na2 @v
a22 ¡ a3 @22 0 . D ¡2(s 2 C 2 0 ) 2 (1 ¡ na2 ) 3 @n
Here we have used the denitions from equation 4.4. Because both a2 and a3 depend on only n and v in the combination v C n / (s 2 C 2 0 ) , they are also constant along the characteristic curves. Using also that n D t from equation A.1, the equation for 2 1 becomes a2 ¡ a3 d2 1 C2 D (s 2 C 2 0 ) 2 (1 ¡ ta2 ) 3 dt
1
a2 . 1 ¡ ta2
This linear differential equation is easily integrated; using the initial condition 2 1 (0) D 0, one nds 2 2 1 D (s C 2 0 )
t (a22 ¡ a3 ) . (1 ¡ ta2 ) 2
Eliminating t again via t D n nally gives the solution, equation 4.4.
1424
Peter Sollich and Anason Halees
Appendix B: Eigenvalue Spectra for the Example Scenarios
We explain rst how the covariance functions with periodic boundary conditions for x 2 [0, 1]d are constructed. Consider rst the case d D 1. The periodic RBF covariance function is dened as C (x, x0 ) D
1 X rD ¡1
c ( x ¡ x0 ¡ r ) ,
(B.1)
where c (x¡x0 ) D exp[¡|x¡x0 | 2 / (2l2 ) ] is the original covariance function. For the periodic OU case, we use instead c ( x ¡ x0 ) D exp(¡|x ¡ x0 | / l). One sees that for sufciently small l (l ¿ 1), only the r D 0 term makes a signicant contribution, except when x and x0 are within ¼ l of opposite ends of the input space (so that either x ¡ x0 C 1 or x ¡ x0 ¡ 1 are of order l). We therefore expect the periodic covariance functions and the conventional nonperiodic ones to yield very similar learning curves, as long as the length scale of the covariance function is smaller than the size of the input domain. The advantage of having a periodic covariance function is that its eigenfunctions are simple Fourier waves and the eigenvalues can be calculated by Fourier transformation. This can be seen as follows. For the assumed uniform input distribution on [0, 1], the dening equation for an eigenfunction w ( x ) with eigenvalue l is, from equation 2.2, « ¬ C ( x, x0 ) w ( x0 ) x0 D
Z
1 0
dx0 C ( x, x0 )w (x0 ) D lw ( x ).
Inserting equation B.1 and assuming that w ( x ) is continued periodically outside the interval [0, 1], this becomes 1 Z X r D ¡1 0
1
dx0 c (x ¡ x0 ¡ r )w (x0 ) D
1 Z X r D ¡1 r
Z D
1
¡1
rC 1
dx0 c ( x ¡ x0 ) w ( x0 ¡ r)
dx0 c ( x ¡ x0 )w (x0 ) D lw ( x ) .
It is well known that the solutions of this eigenfunction equation are Fourier waves w q ( x ) D e2p iqx for integer (positive or negative) q, with corresponding eigenvalues, Z 1 lq D dx c (x ) e¡2p iqx . ¡1
The eigenvalues lq are real since c ( x) D c (¡x ) (this follows from the requirement that the covariance function C ( x, x0 ) must be symmetric). The eigen6 0 are complex in the form given but can be made explicitly functions for q D
Learning Curves for Gaussian Process Regression
1425
p real by linearly transforming the pair w q (x ) and w ¡q ( x) into (1 / 2) cos(2p qx) p and (1 / 2) sin(2p qx ) . All of the above generalizes immediately to higher input dimension d. One denes X C ( x, x0 ) D c ( x ¡ x0 ¡ r ) , r
where r now runs over all d-dimensional vectors with integer components; the argument x ¡ x0 ¡ r of c (¢) is now a d-dimensional real vector. The eigenvalues of this periodic covariance function are then given by Z lq D
dx c ( x ) e¡2p iq¢x .
They are indexed by d-dimensional integer vectors q; the integration is over all real-valued d-dimensional vectors x, and q ¢ x is the conventional dot product between vectors. Explicitly, one derives that for the periodic RBF covariance function, lq D (2p ) d / 2 ld e ¡(2p l)
2 | |q|| 2
/2 .
For the periodic OU covariance function, on the other hand, one has lq D k d ld [1 C (2p l) 2 | |q| | 2 ] ¡(d C 1) / 2 ,
(B.2)
with kd D 2, 2p , 8p for d D 1, 2, 3, respectively; for general d, kd D p ( d¡1 ) / 2 2d C ( (d C 1) / 2) where C ( z) D (z ¡ 1)! is the gamma function. All the bounds and approximations in principle require traces over the whole eigenvalue spectrum, corresponding to sums over an innite number of terms. Numerically, we perform the sums over all eigenvalues up to some suitably large maximal value qmax of ||q| |. The remaining small eigenvalue tail of the spectrum is then treated by approximating | |q|| as a continuous variable qQ and integrating over it from qmax to innity, with the appropriate weighting for the number of vectors q in a shell qQ · ||q|| · qQ C dQq. To check that this procedure gave accurate results, we always veried that the numerically calculated tr ¤ agreed with the known value of C ( x, x) . The third scenario we consider is that of a conventional RBF kernel C ( x, x0 ) D exp[¡| |x ¡ x0 | | 2 / (2l2 ) ] with a nonuniform input distribution that we assume to be an isotropic zero-mean gaussian, P ( x ) / exp[¡| |x|| 2 / (2sx2 )]. The eigenfunctions and eigenvalues are worked out by Zhu, Williams, Rohwer, and Morciniec (1998) for the case d D 1; the eigenvalues are labeled by a nonnegative integer q and given by lq D (1 ¡ b ) b q ,
1426
Peter Sollich and Anason Halees
where b ¡1 D 1 C r/ 2 C
q r2 / 4 C r,
r D l2 / sx2 .
As expected, only the ratio r of the length scales l and sx enters since the overall scale of the inputs is immaterial. (To avoid this trivial invariance, we xed sx2 D 1 / 12; this specic value gives the same variance for each component of x as for the uniform distribution over [0, 1]d used in the other scenarios.) For d > 1, this result generalizes immediately because both the covariance function and the input distribution factorize over the different input components (Opper & Vivarelli, 1999; Williams & Seeger, 2001). The eigenvalues are therefore just products of the eigenvalues for each component and indexed by a d-dimensional vector q of nonnegative integers:
lq D (1 ¡ b ) d b s ,
sD
d X
qi .
(B.3)
iD 1
One sees that P the eigenvalues will come in blocks. All vectors q with the same s D i qi give the same lq . Numerically, we therefore only have to store the different eigenvalues and their multiplicities, which can be shown to be (d C s ¡ 1) ! / [s!(d ¡ 1)!].9 With this trick, so many eigenvalues can be treated by direct summation that a separate treatment of the neglected eigenvalues (via an integral, as above) is unnecessary.
Acknowledgments
We thank Chris Williams, Manfred Opper, and D¨orte Malzahn for stimulating discussions and the Royal Society for nancial support through a Dorothy Hodgkin Research Fellowship.
9
There is a nice combinatorial way of seeing this. Imagine a row of d C s C 1 holes. Each hole is either empty or contains one ball, except the holes at the two ends, which have one ball placed in them permanently. Now consider d ¡ 1 identical balls distributed across the d C s ¡ 1 free holes, and let qi (i D 1, . . . , d) be the number of free holes between balls i ¡ 1 and i (where balls 0 and d are the xed balls in the holes at the left and right end, respectively).P The qi are nonnegative integers, and their sum equals the number of D (d C s ¡ 1) ¡ (d ¡ 1) D s. The number of different assignments free holes, giving i qi of the qi is thus identical to the number of different arrangements of d ¡ 1 identical balls across d C s ¡ 1 holes; hence, the result.
Learning Curves for Gaussian Process Regression
1427
References Barber, D. & Williams, C. K. I. (1997). Gaussian processes for Bayesian classication via hybrid Monte Carlo. In M. C. Mozer, M. I. Jordan, & T. Petsche (eds.), Advances in neural information processing systems, 9 (pp. 340–346). Cambridge, MA: MIT Press. Goldberg, P. W., Williams, C. K. I., & Bishop, C. M. (1998). Regression with inputdependent noise: A gaussian process treatment. In M. I. Jordan, M. J. Kearns, & S. A. Solla (eds.), Advances in neural information processing systems, 10 (pp. 493–499). Cambridge, MA: MIT Press. John, F. (1978). Partial differential equations (3rd ed.). New York: Springer-Verlag. MacKay, D. J. C. (1992). A practical Bayesian framework for backpropagation networks. Neural Computation, 4, 448–472. Malzahn, D., & Opper, M. (2001). Learning curves for gaussian processes regression: A framework for good approximations. In T. K. Leen, T. G. Dietterich, & V. Tresp (eds.), Advances in neural information processing systems, 13 (pp. 273–279). Cambridge, MA: MIT Press. Michelli, C. A., & Wahba, G. (1981).Design problems for optimal surface interpolation. In Z. Ziegler (ed.), Approximation theory and applications (pp. 329–348). Orlando, FL: Academic Press. Neal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo methods (Tech. Rep. No. CRG-TR-93-1). Toronto: University of Toronto. Opper, M. (1997). Regression with gaussian processes: Average case performance. In I. K. Kwok-Yee, M. Wong, I. King, & D.-Y. Yeung (eds.), Theoretical aspects of neural computation: A multidisciplinary perspective (pp. 17–23). New York: Springer. Opper, M., & Vivarelli, F. (1999). General bounds on Bayes errors for regression with gaussian processes. In M. Kearns, S. A. Solla, & D. Cohn (eds.), Advances in neural information processingsystems,11 (pp. 302–308). Cambridge, MA: MIT Press. Plaskota, L. (1990). On the average case complexity of linear problems with noisy information. Journal of Complexity, 6, 199–230. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C (2nd ed.). Cambridge: Cambridge University Press. Ritter, K. (1996). Almost optimal differentiation using noisy data. Journal of Approximation Theory, 86(3), 293–309. Silverman, B. W. (1985). Some aspects of the spline smoothing approach to nonparametric regression curve tting. Journal of the Royal Statistical Society Series B, 47(1), 1–52. Sollich, P. (1994). Finite-size effects in learning and generalization in linear perceptrons. Journal of Physics A, 27, 7771–7784. Sollich, P. (1999a). Approximate learning curves for gaussian processes. In ICANN99—Ninth International Conference on Articial Neural Networks (pp. 437–442). London: Institution of Electrical Engineers. Sollich, P. (1999b). Learning curves for gaussian processes. In M. S. Kearns, S. A. Solla, & D. A. Cohn (eds.), Advances in neural information processing systems, 11 (pp. 344–350). Cambridge, MA: MIT Press.
1428
Peter Sollich and Anason Halees
Sollich, P. (2001). Generalization of Plaskota’s bound for gaussian process learning curves (Tech. rep.). London: King’s College London. Available on-line: www.mth.kcl.ac.uk/»psollich/publications/. Stein, M. L. (1989). Comment on the paper by Sacks, J et al: Design and analysis of computer experiments. Statistical Science, 4, 432–433. Trecate, G. F., Williams, C. K. I., & Opper, M. (1999). Finite-dimensional approximation of gaussian processes. In M. Kearns, S. A. Solla, & D. Cohn (Eds.), Advances in neural information processing systems,11 (pp. 218–224). Cambridge, MA: MIT Press. Williams, C. K. I. (1997). Computing with innite networks. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 295–301). Cambridge, MA: MIT Press. Williams, C. K. I. (1998). Prediction with gaussian processes: From linear regression to linear prediction and beyond. In M. I. Jordan (Ed.), Learning and inference in graphical models (pp. 599–621). Norwell, MA: Kluwer Academic. Williams, C. K. I., & Rasmussen, C. E. (1996). Gaussian processes for regression. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 514–520). Cambridge, MA: MIT Press. Williams, C. K. I., & Seeger, M. (2000). The effect of the input density distribution on kernel-based classiers. In Proceedings of the Seventeenth International Conference on Machine Learning. San Mateo, CA: Morgan Kaufmann. Williams, C. K. I., & Seeger, M. (2001). Using the Nystr om ¨ method to speed up kernel machines. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 682–688). Cambridge, MA: MIT Press. Williams, C. K. I., & Vivarelli, F. (2000). Upper and lower bounds on the learning curve for gaussian processes. Machine Learning, 40(1), 77–102. Wong, E. (1971). Stochastic processes in information and dynamical systems. New York: McGraw-Hill. Zhu, H., Williams, C. K. I., Rohwer, R. J., & Morciniec, M. (1998). Gaussian regression and optimal nite dimensional linear models. In C. M. Bishop (Ed.), Neural networks and machine learning. New York: Springer-Verlag. Received May 10, 2001; accepted September 28, 2001.
LETTER
Communicated by Nicol Schraudolph
A Global Optimum Approach for One-Layer Neural Networks Enrique Castillo
[email protected] Department of Applied Mathematics and Computational Sciences, University of Cantabria and University of Castilla-La Mancha, 39005 Santander, Spain Oscar Fontenla-Romero
[email protected] Bertha Guijarro-Berdinas ˜
[email protected] Amparo Alonso-Betanzos
[email protected] Department of Computer Science, Faculty of Informatics, University of A Coruna, ˜ ˜ Spain 15071 A Coruna,
The article presents a method for learning the weights in one-layer feedforward neural networks minimizing either the sum of squared errors or the maximum absolute error, measured in the input scale. This leads to the existence of a global optimum that can be easily obtained solving linear systems of equations or linear programming problems, using much less computational power than the one associated with the standard methods. Another version of the method allows computing a large set of estimates for the weights, providing robust, mean or median, estimates for them, and the associated standard errors, which give a good measure for the quality of the t. Later, the standard one-layer neural network algorithms are improved by learning the neural functions instead of assuming them known. A set of examples of applications is used to illustrate the methods. Finally, a comparison with other high-performance learning algorithms shows that the proposed methods are at least 10 times faster than the fastest standard algorithm used in the comparison.
1 Introduction
Single-layer networks were widely studied in the 1960s, and their history has been reviewed in several places (Widrow & Lehr, 1990). For a onelayer linear network, the weight values minimizing the sum-of-squares erc 2002 Massachusetts Institute of Technology Neural Computation 14, 1429– 1449 (2002) °
1430
Enrique Castillo, et al.
ror function can be found in terms of the pseudo-inverse of a matrix (Bishop, 1995). Nevertheless, if a nonlinear activation function, such as a sigmoid or hyperbolic tangent, is used or if a different error function is considered, a closed-form solution is no longer possible. However, if the activation function is differentiable, as is the case for the mentioned functions, the derivatives of the error function with respect to the weight parameters can be easily evaluated. These derivatives can then be used in a variety of gradient-based optimization algorithms for efciently nding the minimum of the error function. The performance of these algorithms depends on several design choices, such as the selection of the step size and momentum terms, most of which are currently made by rules of thumb and trial and error. In addition to this problem, it is possible for the one-layer neural network to get stuck in a local minima, as shown by Brady, Raghavan, and Slawny (1989), Sontag and Sussmann (1989), and Budinich and Milotti (1992). The existence of this problem was mathematically demonstrated by Sontag and Sussmann (1989), who provided a neural network example without hidden layers and with a sigmoid transfer function, for which the sum of squared errors has a local minimum that is not a global minimum. They observed that the existence of local minima is due to the fact that the error function is the superposition of functions whose minima are at different points. In the case of linear neurons, all of these functions are convex, so no difculties appear because a sum of convex functions is also convex. However, sigmoidal units lead to nonconvex functions, so it is not guaranteed that their sums will have a unique minimum. Moreover, it was shown (Auer, Hebster, & Warmuth, 1996) that the number of such minima can grow exponentially with the input dimension (d). In particular, they proved that for the squared error and the logistic neural function, the error function of a single neuron for n training examples may have bn / dcd local minima. In fact, this holds for any error and transfer functions for which their composition is continuous and has a bounded range. The absence of local minima under certain separability or linear independence conditions (Budinich & Milotti, 1992; Gori & Tesi, 1992) or modications of the least-squares error norm (Sontag & Sussmann, 1991) has also been proven. However, the problem of characterizing these extrema for the standard least-squares error, using only a nite number of arbitrary inputs, has not been previously solved, nor there is a clear understanding of the conditions that cause the appearance of local minima. Coetzee and Stonick (1996) proposed a method for the a posteriori evaluation, in single-layer perceptrons, of whether a weight solution is unique or globally optimal and for a priori scaling of desired vector values to ensure uniqueness through analysis of the input data. Although these results are potentially useful for evaluating optimality and uniqueness, the minima can be characterized only after training is complete. In this article, a new training algorithm is proposed in order to avoid the problem of local minima in a supervised one-layer neural network with
A Global Optimum Approach for One-Layer Neural Networks
1431
nonlinear activation functions. In section 2, we describe the rationale of the problem. In section 3, the method for learning the weights of the network, leading to a global optimal solution, is presented. This approach is later enhanced by learning the neural functions. Section 4 deals with an alternative learning method that allows studying the variability of the different weights and indirectly testing the quality of the network. Section 5 gives some examples of applications to illustrate how the proposed methods can be used in practice. In section 6, the proposed methods are compared with other highperformance standard methods in terms of learning speed. Finally, section 7 gives some conclusions and recommendations. 2 Motivation
Consider the neural network in Figure 1a, where it is assumed that the nonlinear neural functions f1 , f2 , . . . , fJ are invertible. The set of equations relating inputs and outputs is given by Á yjs D fj wj0 C
I X iD 1
! wji xis I
j D 1, 2, . . . , JI
s D 1, 2, . . . , S,
(2.1)
where wj0 and wji I i D 1, 2, . . . , I, are the threshold values and the weights associated with neuron j (for j D 1, 2, . . . , J) and S is the number of data points. System 2.1 has J £ S equations in J £ ( I C 1) unknowns. However, since the number of data is large (S > > I C 1) , in practice, this set of equations in wji is not compatible and consequently has no solution. Thus, the usual approach is to consider some errors, djs ; equation 2.1 is transformed into Á djs D yjs ¡ fj wj0 C
I X iD 1
! wji xis I
j D 1, 2, . . . , JI
s D 1, 2, . . . , S,
(2.2)
and to estimate (learn) the weights, the sum of squared errors,
Q1 D
J S X X s D1 j D1
djs2 D
J S X X sD 1 j D 1
Á
Á yjs ¡ fj wj0 C
I X
!!2 wji xis
,
(2.3)
i D1
is minimized. It is important to note that due to the presence of the neural functions fj , the function appearing in equation 2.3 is nonlinear in the wji weights. Thus, Q1 is not guaranteed to have a global optimum. Actual methods for learning the weights in neural networks have two shortcomings. First, the function Q1 to be optimized has several (usually many) local optima. This implies that the users cannot know whether they
1432
Enrique Castillo, et al.
w 11
w 20
w J1 w 21
f1
y1 s
f2
y2 s
w 12
...
...
w 22 w J2
w 1I
xIs
fJ
w 2I w JI
yJs
(a)
1
xIs
f1
y1 s
f2
y2 s
fJ
yJs
f1
y1 s
f2
y2 s
(b)
1
x1s
f1
y1s
f2
y2 s
fJ
(c)
x1 s x2 s
...
...
...
x2s
xIs
x2 s
...
x2s
x1 s
...
w J0
x1s
1
w 10
...
1
yJs
fJ
xIs
yJs
(d)
Figure 1: (a) Original one-layer feedforward neural network. (b–d) J independent neural networks leading to the J independent systems of linear equations in equation 3.4.
are in the presence of a global optimum and how far from it the local optimum resulting from the learning process is. In fact, several users of the same computer program can get, for the same data and model, different solutions if they use a different starting set of weights in the learning procedures. Second, the learning process requires a huge amount of computational power compared with other models. Since for each j the weights wj0 and wji I i D 1, 2, . . . , I are related only to yjs , and appear only in S of the J £ S equations in equation 2.2, it is evident that the problem of learning the weights can be separated into J independent
A Global Optimum Approach for One-Layer Neural Networks
1433
problems (one for each j). This means that the neural network in Figure 1a can be split in J independent simpler neural networks (see Figures 1b, 1c, and 1d). Thus, the problem becomes much simpler. In what follows we deal with only one of these problems (for a xed j). 3 Proposed Method for Learning the Weights Based on Linear Systems of Equations
Motivated by the problems indicated in the previous section, an alternative approach for learning the weights is proposed. The system of equations, 2.2, that measures the errors in the output scale (the units of the yjs ) can be written as 2 js D wj0 C
I X i D1
wji xis ¡ fj¡1 ( yjs )I
s D 1, 2, . . . , SI
j D 1, 2, . . . , J,
(3.1)
which measures the errors in terms of the input scale (units of the xis ). We have two possibilities for learning the weights: 1. This option is to minimize the sum of squared errors,
Q2j D
S X sD 1
2 2 js
D
S X sD 1
Á wj0 C
I X i D1
!2 wji xis ¡
fj¡1 ( yjs )
,
(3.2)
which leads to the system of linear equations,
@Q2j @wjp
8 ³ ´ S I P P > ¡1 > > wj0 C wji xis ¡ fj ( yjs ) xps D 0 P P > ¡1 > wj0 C wji xis ¡ fj ( yjs ) D 0 :2 s D1
I P i D1 I P i D1
Á Á
iD 1
if
p> 0 (3.3)
if
p D 0,
which can be written as ! S S P P ( fj¡1 ( yjs ) ¡ wj0 ) xps I p D 1, 2, . . . , I, xis xps wji D
sD 1 S P sD 1
s D1
!
xis wji
D
S P s D1
( fj¡1 ( yjs )
¡ wj0 ).
9 > > > > = > > > > ;
.
(3.4)
2. Alternatively, we can minimize the maximum absolute error, ) I X ¡1 ( ) 2 D min max wj0 C wji xis ¡ f yjs , w s D1 (
i
(3.5)
1434
Enrique Castillo, et al.
which can be stated as the following linear programming problem: Minimize 2 subject to 9 I P > ¡1 · fj ( yjs ) > > wj0 C wji xis ¡ 2 = ¡wj0 ¡
iD 1 I P iD 1
wji xis ¡ 2
·
>
> ¡ fj¡1 ( yjs ) > ;
I
s D 1, 2, . . . , SI
(3.6)
the global optimum can be easily obtained by well-known linear programming techniques. In addition, it is easy to nd the set of all possible solutions leading to the global optimum. This can be achieved by testing which constraints in equation 3.6 are active and solving the corresponding systems of linear equations to nd the extreme points of the solution polytope. Consequently, we obtain the global optimum by solving a linear system of equations or a linear programming problem, methods that require much less computational power than that involved in minimizing Q1 in equation 2.3. 3.1 Learning the Neural Functions. An important improvement is obtained if we learn the fj¡1 functions instead of assuming that they are known. More precisely, they can be assumed to be linear convex combinations of a set
fw 1 ( x ) , w 2 ( x ) , . . . , w R ( x )g of invertible basic functions, fj¡1 ( x ) D
R X rD 1
ajr w r ( x ) I j D 1, 2, . . . , J,
(3.7)
where fajr I r D 1, 2, . . . , Rg is the set of coefcients associated with function fj¡1 , which has to be chosen for the resulting function fj¡1 to be invertible. Without loss of generality, it can be assumed to be increasing. Then, as before, we have two options: 1. Minimize, with respect to wji I i D 0, 1, . . . , I and ajr I r D 1, 2, . . . , R, the function Á !2 S S I R X X X X 2 ajr w r ( yjs ) , (3.8) Q2j D 2 js D wj0 C wji xis ¡ s D1
sD 1
iD 1
rD 1
subject to R X r D1
ajr w r ( yjs1 ) ·
R X r D1
ajr w r ( yjs2 ) I
8yjs1 < yjs2 ,
A Global Optimum Approach for One-Layer Neural Networks
1435
which forces the candidate functions fj¡1 to be increasing, at least in the corresponding intervals. 2. Alternatively, for each j D 1, 2, . . . , J, we can minimize the maximum absolute error, ( ) I R X X ( ) C ajr w r yjs , (3.9) 2 D min max wj0 wji xis ¡ w,a s D1 D1 i
r
which can be stated as the following linear programming problem: Minimize 2 subject to wj0 C ¡wj0 ¡
I P i D1 I P i D1
wji xis ¡ wji xis C
R P rD 1 R P rD 1
ajr w r ( yjs ) ¡2 · 0 ajr w r ( yjs ) ¡2 · 0
R P
ajr w r ( yjs1 ) ·
rD 1 R P
r D1
R P r D1
(3.10) ajr w r ( yjs2 )I
8yjs1 < yjs2
ajr w r ( y 0 ) D 1,
which has a global optimum easily obtainable. In order to avoid transformations with a large derivative, it is convenient to limit the rst derivative of the fj¡1 ( yjs ) transformation. This can be done by adding the extra constraints, dfj¡1 ( y) dy
D
R X r D1
ajr w r0 ( y ) ¸ bI j D 1, 2, . . . , J,
(3.11)
where b is the upper bound of the derivative of the inverse function that is equal to the inverse of the derivative of the original function. Once the fj¡1 have been obtained, to use the network, it is necessary to work with their inverses, which can be dealt with using, for example, the bisection method. To avoid problems with the dimensions of the data, it is strongly suggested that the data be transformed to the unit hypercube. 4 Alternative Learning Method: Variability Study
If we write equation 2.1 as wj0 C
I X iD 1
wji xis D fj¡1 ( yjs )I
s D 1, 2, . . . , S,
(4.1)
1436
Enrique Castillo, et al.
since the number of unknowns is (I C 1), we realize that a given set with (I C 1) data points is enough (apart from a degenerated linear system) for learning the weights fwij I i D 0, 1, . . . , Ig. The proposed alternative consists of learning the weights with a selected (deterministically or randomly) class of subsets, each containing v ¸ ( I C 1) data points, and determining the variability of the different weights based on the resulting values. To proceed, repeat steps 1 and 2 below, a large number of times, that is, from r D 1 to r D m with m large, and then proceed to steps 3 and 4: Step 1: Select, at random, a subset Dr of v ¸ ( I C 1) data points from the
initial set of S data points.
Step 2: Use the system of equations 3.4 or 3.6, or solve the linear program-
ming problems in equations 3.8 and 3.10 to obtain the corresponding set of weights fwjir I i D 0, 1, . . . , Ig. Step 3: Calculate the means m ji , medians Mji , and standard deviations sji
of the sets fwjir gI 8i.
O ji D Mji as the estimates of wji , and sji as a Step 4: Return wO ji D m ji or w measure of the precision in the estimate of wji .
5 Example Applications
In this section, we illustrate the proposed methods by their application to well-known sets of data examples. The rst two cases, in sections 5.1 and 5.2, 1 correspond to classication problems and are used to verify the soundness of the proposed learning methods in sections 3 and 4. The last two examples, in sections 5.3 and 5.4 2 , are regression problems employed to compare the approach proposed in section 3, the improvement method presented in section 3.1 that allows the learning of the neural functions and the Levenberg-Marquardt backpropagation method (Hagan & Menhaj, 1994). In order to estimate the true error of the neural networks, a leave-one-out cross-validation was used, except in the case of example 5.3. 5.1 The Fisher Iris Data Example. Our rst example uses the Iris data (Fisher, 1936), perhaps one of the best-known databases in the pattern recognition literature. Fisher’s article is a classic in the eld and continues to be referenced frequently. The data set contains three classes of 50 instances 1 Data obtained from the UCI Machine Learning Repository (http://www.ics.uci.edu /s mlearn/MLRepository.html). 2 Data obtained from the Working Group on Data Modeling Benchmarks of the IEEE Neural Networks Council Standards Committee (http://neural.cs.nthu.edu.tw /jang/benchmark).
A Global Optimum Approach for One-Layer Neural Networks
1437
Table 1: Estimated Weights for Iris Data and Associated Standard Deviations. Weight
Value
Mean
Median
SD
w10 w11 w12 w13 w14
0.9704812 ¡0.0273781 ¡0.0498872 0.0722752 0.0979362
0.9642 ¡0.0263 ¡0.0495 0.0713 0.0996
0.9567 ¡0.0262 ¡0.0500 0.0717 0.0985
0.0614 0.0149 0.0170 0.0128 0.0199
Note: Equation 3.6 and the method in section 4 are used.
each, where each class refers to a type of iris plant (Setosa, Versicolour, and Virginica). One class is linearly separable from the other two, which are not linearly separable from each other. The attribute information used to determine the type of plant (1 for Setosa, 2 for Versicolour, and 3 for Virginica) consists of sepal length in centimeters, sepal width in centimeters, petal length in centimeters, and petal width in centimeters. We have used the neural network described in section 2 and Figure 1 and the learning method described in section 3. The neural function has been assumed to be f ¡1 (x ) D arctan(x) . This neural function was used also in all the following examples. The resulting value of the mean squared error was 0.0337, and the obtained weights are shown in the second column of Table 1. The dot plot in Figure 2B shows the value of the output for each of the 150 test examples employed in the leave-one-out cross-validation and the cutoff discriminating values (dashed lines) for the three groups. The resulting classication rule is as follows: 4 P 1. If tan ( w10 C wji xis ) · 1.4, the plant s is in class Setosa. i D1
2. If 1.4 < tan ( w10 C
4 P iD 1
3. If 2.365 < tan ( w10 C
wji xis ) · 2.365, the plant s is in class Versicolour.
4 P
wji xis ), the plant s is in class Virginica.
iD 1
With these criteria, 98.67% of the plants are classied correctly. The confusion matrix in Table 2 shows the number of cases misclassied. Next, the alternative learning method described in section 4 was used. One hundred subsets of 50 randomly selected data points were used to learn the set of weights fwi1 I i D 0, 1, 2, 3, 4g. Their means, medians, and standard deviations and their associated empirical cumulative distributions functions are shown in Table 1 and Figure 2A. A comparison of the different weight estimates shows that they are very similar. On the other hand, the low values
1438
Enrique Castillo, et al.
Figure 2: (A) Empirical cumulative distribution functions of the weights. (B) Plot for the iris data example.
A Global Optimum Approach for One-Layer Neural Networks
1439
Table 2: Confusion Matrix for the Iris Data. System Classication Setosa Versicolour Virginica
True Classication Setosa
Versicolour
Virginica
50 0 0
0 49 1
0 1 49
Table 3: Attributes Used in Breast Cancer Data. Attribute Clump thickness Uniformity of cell size Uniformity of cell shape Marginal adhesion Single epithelial cell size Bare nuclei Bland chromatin Normal nucleoli Mitoses
Range 1–10 1–10 1–10 1–10 1–10 1–10 1–10 1–10 1–10
of the standard deviations suggest that the model is adequate for the Iris data set. 5.2 Breast Cancer Data. The breast cancer database was obtained from the University of Wisconsin Hospitals, Madison, from William H. Wolberg (Bennett & Mangasarian, 1992). It consists of 699 instances—458 (65.5%) benign and 241 (34.5%) malignant—with 16 having some missing data. We have used only the data with complete information: 683 data instances. The meanings of the nine attributes used to determine the class (0 for benign, 1 for malignant) are shown in Table 3. The dot plot in Figure 3B shows the value of the output for each of the 683 test examples employed in the leaveone-out cross-validation and the cutoff discriminating values (dashed lines) for the two classes. The resulting classication criterion is as follows:
1. If tan ( w10 C 2. If tan ( w10 C
9 P i D1 9 P
wji xis ) · 0.28, the cancer s is classied as malignant. wji xis ) > 0.28, the cancer s is classied as benign.
i D1
The number of cases classied correctly is 96.92%. The confusion matrix in Table 4 shows the number of misclassied data for each of the classes. Using the methods described in sections 3 and 4, the weight estimates with their standard deviations, shown in Table 5, have been obtained for 100 sub-
1440
Enrique Castillo, et al.
Figure 3: (A) Empirical cumulative distribution functions of the weights. (B) Plot for the breast cancer data.
A Global Optimum Approach for One-Layer Neural Networks
1441
Table 4: Confusion Matrix for Breast Cancer Data. System Classication Benign Malignant
True Classication Benign
Malignant
432 12
9 230
Table 5: Estimated Weights for Breast Cancer Data and Associated Standard Deviations. Weight
Value
Mean
Median
SD
w10 w11 w12 w13 w14 w15 w16 w17 w18 w19
1.0530 0.0069 0.0048 0.0034 0.0018 0.0022 0.0099 0.0042 0.0041 0.0002
1.0415 0.0112 ¡0.0028 0.0089 0.0036 0.0031 0.0089 0.0025 0.0060 0.0033
1.0411 0.0109 ¡0.0015 0.0074 0.0030 0.0029 0.0089 0.0025 0.0059 0.0035
0.0128 0.0030 0.0074 0.0070 0.0041 0.0047 0.0035 0.0047 0.0043 0.0054
Note: Equation 3.6 and section 4 were used.
sets of 300 randomly selected data points. Figure 3A contains the associated empirical cumulative distributions of the weights. The estimates are very similar. 5.3 Modeling a Three-Input Nonlinear Function. In this example, we consider using the learning algorithm proposed in section 3 (see equation 3.3) and the improved method proposed in section 3.1 to model the nonlinear function
y D z2 C z C sin(z) , where z D 3x1 C 2x2 ¡ x3 . We carried out 100 simulations employing 800 training data and 800 test data uniformly sampled from the input ranges [0, 1] £ [0, 1] £ [0, 1]. The values of the nonlinear function y were normalized in the interval [0.05, 0.95]. In the method with learnable neural functions, we employ the following polynomial family, fw 1 ( y) , w 2 ( y ) , . . . , w R ( y ) g D fx, x2 , . . . , xR g,
(4.2)
1442
Enrique Castillo, et al.
Table 6: Mean Error and Variability for the Three-Input Nonlinear Function.
NN with known neural functions NN with learnable neural functions NN trained with Levenberg-Marquardt
Mean Error
Variability
1.155e-02 5.438e-03 1.189e-02
1.729e-04 6.509e-06 1.746e-04
as basic functions in equation 3.7. Several values of R and b (see equation 3.11) were tried. The results presented in this example were achieved with R D 7 and b D 0.4. Table 6 shows the mean error and the variability obtained for the test data over these 100 simulations. Also, the results obtained by the Levenberg-Marquardt method are shown in this table in order to compare our methods with a standard one. We tested the statistical signicance of the differences between the mean error of the systems in Table 6 using a t-test. As can be seen, the neural network with known neural functions and that trained using the Levenberg-Marquardt obtained similar performance, whereas for a 99% signicance level, the neural network with learnable functions outperformed the other two approaches. Figure 4 shows a graph of the real output against the desired output (for one of the simulations) and the rst 100 samples of the curve predicted by the networks in Table 6. 5.4 Box-Jenkins Furnace Data. This example deals with the problem of modeling a gas furnace that was rst presented by Box and Jenkins (1970). The modeled system consists of a gas furnace in which air and methane are combined to form a mixture of gases containing carbon dioxide (CO2 ). The furnace output, the CO2 concentration, is measured in the exhaust gases at the outlet of the furnace. The data set corresponds to a time series consisting of 296 pairs of observations of the form ( ( u ( t) , y ( t) ), where u ( t) represents the methane gas feed rate at the time step t and y (t ) is the concentration of CO2 in the gas outlets. The sampling time interval is 9 seconds. The goal is to predict y ( t) based on fy(t ¡ 1) , y (t ¡ 2) , y ( t ¡ 3) , y ( t ¡ 4) , u (t ¡ 1) , u ( t ¡ 2), u (t ¡ 3) , u ( t ¡ 4) , u (t ¡ 5), u ( t ¡ 6) g. This reduces the number of effective data points to 290. Also, before the training began, the output, y (t ), was normalized in the interval [0.05,0.95]. The methods proposed in sections 3 (neural network with known neural functions using equation 3.1) and 3.1 (neural network with learnable neural functions) were employed to predict the CO2 concentration. In the latter, several values of R and b were tried. The results obtained, for test data, using the polynomial family in equation 4.2, R D 15 and b D 0.4, are shown in Table 7. Again, results obtained by the Levenberg-Marquardt method are included in this table. To test the signicance of the difference between the mean errors of the systems in Table 7, a t-test was used. As in the previous
A Global Optimum Approach for One-Layer Neural Networks
1443
Figure 4: Comparison of the three-input nonlinear function. Neural network with given neural functions: (A) real against desired output and (B) observed (solid line) and predicted (dashed line) curves. Neural network with learnable neural functions: (C) real against desired output and (D) observed (solid line) and predicted (dashed line) curves. Levenberg-Marquardt method: (E) real against desired output and (F) observed (solid line) and predicted (dashed line) curves.
1444
Enrique Castillo, et al.
Table 7: Mean Error and Variability, Box-Jenkins Furnace Data.
NN with known neural functions NN with learnable neural functions NN trained with Levenberg-Marquardt
Mean Error
Variability
2.084e-02 1.667e-02 2.090e-02
2.617e-04 1.794e-04 2.626e-04
example, for a 99% signicance level, signicant performance differences between the neural network with learnable functions and the other two approaches were found. Therefore, we can afrm that the improved method proposed in section 3.1 outperforms the rst one introduced (in section 3), and so the learning of the neural functions enhances the performance of the system. Figure 5 shows a plot of the real output against the desired output and the source data versus the time series predicted by our two neural systems and the Levenberg-Marquardt method. 6 A Learning Speed Comparison
This section contains a comparative study of the learning speed between the method proposed in section 3 (see equation 3.4) and several high performance algorithms. To this end, the examples described in sections 5.1, 5.2, 5.3, and 5.4 are used. The following algorithms were used in the experiments: Levenberg-Marquardt backpropagation (LM) (Hagan & Menhaj, 1994) Fletcher-Powell conjugate gradient (CGF) (Fletcher & Reeves, 1964; Hagan, Demuth, & Beale, 1996) Polak-Ribi e re conjugate gradient (CGP) (Fletcher & Reeves, 1964; Hagan et al., 1996) Conjugate gradient with Powell-Beale restarts (CGB) (Powell, 1977; Beale, 1972) Scaled conjugate gradient (SCG) (Moller, 1993) Quasi-Newton with Broyden, Fletcher, Goldfarb, and Shanno updates (BFGS) (Dennis & Schnabel, 1983) One-Step secant backpropagation (OSS) (Battiti, 1992) Resilient backpropagation (RP) (Riedmiller & Braun, 1993) For each of the data sets, 100 trials, with initial randomly weights, were carried out using a Silicon Graphics Origin 200 computer. The networks were trained using the mean squared error function in all cases. The training was stopped when the error goal (4e-4, 3e-4, 7.7e-3, and 1.7e-2 for examples
A Global Optimum Approach for One-Layer Neural Networks
1445
Figure 5: Comparison of results for Box-Jenkins data. Neural network with known neural functions: (A) real against desired output and (B) observed (solid line) and predicted (dashed line) time series. Neural network with learnable neural functions: (C) real against desired output and (D) observed (solid line) and predicted (dashed line) time series. Levenberg-Marquardt method: (E) real against desired output, and (F) observed (solid line) and predicted (dashed line) time series.
1446
Enrique Castillo, et al.
Table 8: Training Time for Iris Data.
Algorithm
Mean Time(s)
Ratio
Minimum Time(s)
Maximum Time(s)
SD
SLE LM CGF CGP CGB SCG BFGS OSS RP
0.0054 0.1538 0.7929 0.8885 0.7166 0.6719 1.3441 7.5704 7.2496
1.00000 28.4800 146.830 164.540 132.700 124.430 248.910 1401.93 1342.52
0.0053 0.1255 0.1133 0.1260 0.1131 0.1190 0.7065 1.4436 1.1705
0.0059 0.1837 1.4683 1.6335 1.2714 1.6152 1.7931 22.398 19.511
0.000074 0.012203 0.269170 0.320160 0.211600 0.200100 0.266610 4.333000 4.351700
Table 9: Training Time for Breast Cancer Data.
Algorithm
Mean Time(s)
Ratio
Minimum Time(s)
Maximum Time(s)
SD
SLE LM CGF CGP CGB SCG BFGS OSS RP
0.0245 0.2351 1.3737 1.6820 1.1796 0.7870 2.7466 39.230 1.6540
1.0000 9.6000 56.070 68.650 48.150 32.120 112.11 1601.2 67.510
0.0238 0.1989 0.1219 0.1338 0.1226 0.1296 0.6902 2.7709 0.6553
0.0265 0.3119 12.592 20.211 7.3879 1.9263 8.3371 173.39 2.1685
0.00038 0.02264 1.62990 2.86940 0.86438 0.19669 0.84858 56.0295 0.31739
5.1, 5.2, 5.3, and 5.4, respectively) was attained, the maximum number of epochs (2.000) was reached, or the gradient fell below a minimum threshold gradient (1e-10). Tables 8 through 11 summarize the training speed results of the networks associated with all of the algorithms used. The tables contain the mean, the minimum and maximum time and the standard deviation obtained in 100 simulations. Also, the fastest algorithm is used as the reference for the time ratio parameter, so that it reects how many times each algorithm is slower, on average, than the fastest algorithm. The method proposed in this article was identied with the SLE (system of linear equations) acronym. As can be seen in the tables, even in the worst case, the SLE method is about 10 times faster than the next faster algorithm. Finally, it is also important to note that the high speed differences obtained between our proposed method and the others are due to the iterative nature of the latter, while the former requires only solving a linear system of equations or a programming problem.
A Global Optimum Approach for One-Layer Neural Networks
1447
Table 10: Training Time for Nonlinear Function Data.
Algorithm
Mean Time(s)
Ratio
Minimum Time(s)
Maximum Time(s)
SD
SLE LM CGF CGP CGB SCG BFGS OSS RP
0.0057 0.1768 0.8346 0.8814 0.7844 0.5832 1.1666 3.3554 3.1387
1.0000 31.020 146.42 154.63 137.61 102.32 204.67 588.67 550.65
0.0055 0.1642 0.5026 0.4800 0.4426 0.3701 0.7801 1.6546 1.0870
0.0066 0.1901 1.3558 1.8231 1.3517 0.8441 1.4787 7.6841 4.5230
0.00015 0.01122 0.19879 0.31403 0.22015 0.10821 0.13563 1.03380 0.98790
Table 11: Training Time for Box-Jenkins Data.
Algorithm
Mean Time(s)
Ratio
Minimum Time(s)
Maximum Time(s)
SD
SLE LM CGF CGP CGB SCG BFGS OSS RP
0.0243 0.3178 6.7133 13.658 6.0728 6.2298 3.5819 106.25 31.572
1.0000 13.080 276.27 562.05 249.91 256.37 147.40 4372.4 1299.3
0.0224 0.2306 2.5028 4.2101 4.2211 3.8638 2.4899 69.439 31.464
0.0963 1.0212 10.770 98.329 8.4621 11.047 4.2659 113.09 32.032
0.0073 0.0892 1.4806 23.169 0.9492 1.2715 0.3173 9.6851 0.0785
7 Conclusion
In this article a new learning method for single-layer neural networks has been presented. This approach, which measures the errors in terms of the input scale instead of the output scale, allows reaching the global optimum of the error surface avoiding local minima. Two alternative training procedures were proposed for learning the weights. The rst one, which employs the sum of squared errors as a cost function, is based on a system of linear equations, and the second one, which minimizes the maximum absolute error function, is based on a linear programming problem. It is worthwhile mentioning that due to the techniques employed to minimize the objective function, these approaches exhibit a very fast convergence. The proposed methods were later extended so that learning the neural functions is also possible. To this end, these functions were assumed to be a linear combination of invertible basic functions. In all the models presented in this article, a polynomial combination has been used, but many others can be considered (e.g., Fourier expansion). Finally, the examples show that the proposed
1448
Enrique Castillo, et al.
methods present good performance on well-known data sets. Moreover, the approach that allows learning the neural functions seems to be superior to the one with xed neurons. The comparison with other learning algorithms shows that the proposed method clearly outperforms all existing methods. Finally, an important issue to address is the correspondence between minima of the standard error (see equation 2.3) and the novel formulation of the error (see equation 3.2). This correspondence is exact when the error is zero, that is, when the estimated model coincides with the real model underlying the data. Under other conditions, this correspondence does not exist and needs further analysis, which will be addressed in future work. Acknowledgments
We thank the Universities of Cantabria and Castilla-La Mancha, the DirecciÂon General de Investigaci Âon CientÂõ ca y TÂecnica (DGICYT) (project PB98-0421), Iberdrola, and the Xunta de Galicia (project PGIDT99COM10501 and the predoctoral grants programme 2000/01) for partial support of this research. References Auer, P., Hebster, M., & Warmuth, M. K. (1996). Exponentially many local minima for single neurons. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural processing systems, 8 (pp. 316–322). Cambridge, MA: MIT Press. Battiti, R. (1992). First and second order methods for learning: Between steepest descent and Newton’s method. Neural Computation, 4(2), 141–166. Beale, E. M. L. (1972).A derivation of conjugate gradients. In F. A. Lootsma (Ed.), Numerical methods for nonlinear optimization. London: Academic Press. Bennett, K. P., & Mangasarian, O. L. (1992). Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods and Software, 1, 23–34. Bishop, C. M. (1995). Neural networks for pattern recognition. New York: Oxford University Press. Box, G. E. P., & Jenkins, G. M. (1970). Time series analysis, forecasting and control. San Francisco: Holden Day. Brady, M., Raghavan, R., & Slawny J. (1989). Back propagation fails to separate where perceptrons succeed. IEEE Transactions on Circuits and Systems, 36, 665– 674. Budinich, M., & Milotti, E. (1992). Geometrical interpretation of the backpropagation algorithm for the perceptron. Physica A, 185, 369–377. Coetzee, F. M., & Stonick V. L. (1996). On uniqueness of weights in single layer perceptrons. IEEE Transactions on Neural Networks, 7(2), 318–325. Dennis, J. E., & Schnabel, R. B. (1983). Numerical methods for unconstrained optimization and nonlinear equations. Englewood Cliffs, NJ: Prentice Hall.
A Global Optimum Approach for One-Layer Neural Networks
1449
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annual Eugenics, 7, 179–188. Fletcher, R., & Reeves, C. M. (1964). Function minimization by conjugate gradients. Computer Journal, 7, 149–154. Gori, M., & Tesi, A. (1992). On the problem of local minima in backpropagation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(1), 76–85. Hagan, M. T., Demuth, H. B., & Beale, M. H. (1996).Neural network design. Boston: PWS Publishing. Hagan, M. T., & Menhaj, M. (1994). Training feedforward networks with the Marquardt algorithm. IEEE Transactions on Neural Networks, 5(6), 989–993. Moller, M. F. (1993). A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks, 6, 525–533. Powell, M. J. D. (1977). Restart procedures for the conjugate gradient method. Mathematical Programming, 12, 241–254. Riedmiller, M., & Braun, H. (1993). A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In Proceedings of the IEEE International Conference on Neural Networks. San Francisco. Sontag, E., & Sussmann, H. J. (1989). Backpropagation can give rise to spurious local minima even for networks without hidden layers. Complex Systems, 3, 91–106. Sontag, E., & Sussmann, H. J. (1991). Back propagation separates where perceptrons do. Neural Networks, 4, 243–249. Widrow, B., & Lehr, M. A. (1990). Thirty years of adaptive neural networks: perceptron, madeline and backpropagation. Proceedings of the IEEE, 78(9), 1415–1442. Received May 11, 2001; accepted August 22, 2001.
LETTER
Communicated by J urgen ¨ Schmidhuber
MLP in Layer-Wise Form with Applications to Weight Decay Tommi K a¨ rkka¨ inen
[email protected]. Department of Mathematical Information Technology, University of Jyv¨askyl¨a, P.O.Box 35 (Agora), FIN-40351 Jyv¨askyl¨a, Finland
A simple and general calculus for the sensitivity analysis of a feedforward MLP network in a layer-wise form is presented. Based on the local optimality conditions, some consequences for the least-means-squares learning problem are stated and further discussed. Numerical experiments with formulation and comparison of different weight decay techniques are included.
1 Introduction
There are many assumptions and intrinsic conditions behind well-known computational techniques that are sometimes lost from a practitioner. In this work, we try to enlighten some of the issues related to multilayered perceptron networks (MLPs). The point of view here is mainly based on the theory and practice of optimization, with special emphasis on sensitivity analysis (computation of gradient), scale balancing, convexity, and smoothness. First, we propose a simplied way to derive the error backpropagation formulas for the MLP training in a layer-wise form. The basic advantage of the layer-wise formalism is that the optimality system is presented in a compact form that can be readily exploited in an efcient computer realization. Moreover, due to the clear description of the optimality conditions, we are able to derive some consequences and interpretations concerning the nal structure of trained network. These results have direct applications to different weight decay techniques, which are presented and tested through numerical experiments. In section 2, we introduce the algebraic formalism and the original learning problem. In section 3, we compute the optimality conditions for the network learning and derive and discuss some of their consequences. Finally, in section 4 we present numerical experiments for studying different regularization techniques and make some observations based on the computational results. c 2002 Massachusetts Institute of Technology Neural Computation 14, 1451– 1480 (2002) °
1452
Tommi K¨arkk¨ainen
2 Preliminaries
Action of the multilayered perceptron in a layer-wise form is given by (Rohas, 1996; Hagan & Menhaj, 1994) o 0 D x,
o l D F l ( W l oO
( l¡1 )
) for l D 1, . . . , L.
(2.1)
We have placed the layer number (starting from zero for the input) as an upper index. By b we indicate the addition of bias terms, and F l (¢) denotes the usual componentwise activation on the lth level. The dimensions of the weight matrices are given by dim( W l ) D nl £ (nl¡1 C 1) , l D 1, . . . , L, where n 0 is the length of an input vector x, nL the length of the output vector o L , and nl , 0 < l < L, determine the sizes (number of neurons) of the hidden layers. We notice that the activation of m neurons can be equivalently represented by using a diagonal function matrix F D F (¢) D Diagf fi (¢) gm i D1 of the form 2 3 0 f1 (¢) . . . 6 .. 7 . .. F D 4 ... (2.2) . . 5 0 . . . fm (¢) A function matrix suppliedP with the natural way to dene the matrix vector product y D F (v ) , yi D jmD 1 fij ( vj ) yields the usual activation approach. However, any enlargement of the function matrix to contain nondiagonal function entries as well offers interesting possibilities for generalizing the basic MLP architecture. This topic is out of the scope of the work presented here, although both the sensitivity analysis and its consequences in what follows remain valid in this case©too. ª N Using a given training data xi , y i iD 1 , xi 2 Rn0 and y i 2 RnL , the unknown weight matrices fW l gLlD1 are determined as a solution of the optimization problem min J (fW l g),
(2.3)
fW l gLlD 1
where
J (fW l g) D
N 1 X kN 2N iD 1
(fW l g) ( x i ) ¡ y i k2
(2.4)
is the least-mean-squares (LMS) cost functional for a simplied architecture of MLP containing only a linear transformation in the nal layer. That is, ( ) fiL (x ) D x for all 1 · i · nL , and o Li D N (fW l g) ( xi ) D W L oO i L¡1 . The relation
MLP with Applications to Weight Decay
1453
between this choice and the original form, equation 2.1, will be considered in sections 3.3 and 3.4. In equation 2.4, k ¢ k denotes the usual Euclidian norm, which is induced by the l2 inner product ( v , w ) D w T v . Preprocessing of the training data fxi , y i g has been extensively considered, for example, by LeCun, Bottous, Orr, and Muller ¨ (1998). From the optimization point of view, an essential property of any problem is to have all unknowns on the same scale, because gradient with components of different orders of magnitude can yield nonbalanced updates in the training algorithm (Nocedal & Wright, 1999). This usually leads to nonconvergence or large deviation of weights, decreasing the fault tolerance of the trained network (Cavalieri & Mirabella, 1999). Therefore, we require that all activation functions fF l g have the same range, and all of the training data are prescaled into this range. Then all layers of the network treat vectors with components of the same order of magnitude. Notice, however, that using tanh activation and the proposed prescaling, the average of minimum and maximum values of each feature in fxi g is transformed to zero. Then all records containing the zero value become insensitive to the corresponding column in the weight matrix W 1 . Hence, for features having symmetric and peak-like distribution, the prescaling may result in the nonuniqueness of W 1 (and subsequent matrices W l , 1 < l · L). Naturally, one can study the existence of such problems through the distribution (histogram) of individual prescaled features. 3 Sensitivity Analysis
An essential part of any textbook on neural networks consists of the derivation of error-backpropagation formulas for network training (Bishop, 1995; Reed & Marks, 1999; Rohas, 1996). What makes such sensitivity analysis messy is the consecutive application of the chain rule in an index jungle. Here, we describe another approach that uses the Lagrangian treatment of equality constraint optimization problems (Bertsekas, 1982) and circumvents these technicalities, allowing one to derive the necessary optimality conditions in a straightforward way. The idea behind our treatment is to work on the same level of abstraction, layer-wise form, that is used for the initial description of MLP. There are many techniques for solving optimization problems like equation 2.3 without the need of precise formulas for the derivatives. One class of techniques consists of the so-called gradient-free optimization methods as presented, for example, in Osman and Kelly (1996) and Bazaraa, Sherali, and Shetty (1993). The computational efciency of these approaches, however, cannot compete with methods using gradient information during the search procedure. Another interesting possibility for avoiding explicit sensitivity analysis is to use automatic differentiation (AD), which enables the computation of derivatives as a (user-hidden) by-product of cost function evaluation (Griewank, 2000). AD allows fast and straightforward develop-
1454
Tommi K¨arkk¨ainen
ment and simulation of different architectures and learning problems for MLPs, but at the expense of increased storage requirements and computing time compared to the fully exploited analytical approach (see the discussion after theorem 2 in section 3.2). Moreover, clear and precise description of optimality conditions is necessary to be able to analyze the properties of trained networks (cf. sections 3.3 and 3.4). 3.1 MLP with One Hidden Layer. To simplify the presentation, we start ¤ ¤ with MLP with only one hidden layer. Then any local solution ( W 1 , W 2 ) of the minimization problem, equation 2.3, is characterized by the conditions ¤
¤
r(W1, W2 ) J (W1 , W2 ) D ³ Here, rW l J D
´ @J @wlij
i, j
µ
µ ¶ ¤ ¤ ¶ rW 1 J ( W 1 , W 2 ) O . D ¤ ¤ O rW 2 J ( W 1 , W 2 )
(3.1)
, l D 1, 2, are given in a similar matrix form as the
unknown weight matrices. Hence, our next task is to derive the derivatives with regard to these data structures (cf. appendix B in Diamantaras & Kung, 1996; Hagan & Menhaj, 1994). For this analysis, we presuppose that all activation functions in the function matrix F D F 1 are differentiable. We start the derivation by stating some simple lemmas for which the proofs are given in appendix 4.2. We note that the proposed approach is not restricted to the LMS error function. For other differentiable cost functionals like softmax and cross-entropy, one can use exactly the same technique for simple derivation of the necessary optimality conditions in a layer-wise form. Lemma 1. Let v 2 Rm1 and y 2 Rm2 be given vectors. The gradient matrix rW J (W ) 2 Rm2 £m1 for the functional J ( W ) D 12 kW v ¡ y k2 is of the form
rW J (W ) D [W v ¡ y ] v T . Let W 2 Rm2 £m1 be a given matrix, y 2 Rm2 a given vector, and F D 1 m1 ( ) Diagf fi (¢) gm i D1 a given diagonal function matrix. The gradient vector ru J u 2 R 1 2 ( ) ( ) for the functional J u D 2 kW F u ¡ y k reads as Lemma 2.
± ²T 0 0 ru J (u ) D W F ( u ) [W F (u ) ¡ y ] D DiagfF ( u ) g W T [W F ( u ) ¡ y ].
N 2 Rm2 £m1 be a given matrix, F D Diagf fi (¢) gm1 a given Let W iD 1 diagonal function matrix, and v 2 Rm0 , y 2 Rm2 given vectors. The gradient N F (Wv ) ¡ y k2 is of matrix rW J ( W ) 2 Rm1 £m0 for the functional J ( W ) D 12 kW the form Lemma 3.
0
N T [W N F ( Wv ) ¡ y ] v T . rW J ( W ) D DiagfF ( Wv )g W
MLP with Applications to Weight Decay
1455
Now we are ready to state the actual result for the perceptron with one hidden layer. Gradient matrices rW 2 J ( W 1 , W 2 ) and rW 1 J ( W 1 , W 2 ) for the cost functional, equation 2.4, are of the form Theorem 1.
rW 2 J (W 1 , W 2 ) D D
rW 1 J ( W 1 , W 2 ) D D
N 1 X b( W 1 xO i ) ¡ y i ] [F b( W 1 xO i )]T [W 2 F N iD 1
i.
N 1 X b(W 1 xO i ) ]T , e i [F N iD 1 N 1 X 0 b(W 1 xO i ) ¡ y i ]xO T ii. DiagfF (W 1 xO i ) g( W 21 ) T [W 2 F i N i D1 N 1 X 0 DiagfF ( W 1 xO i ) g ( W 21 ) T ei xO Ti . N iD 1
¡ ¢ In formula ii, W 21 is the submatrix W 2 i, j , i D 1, . . . , n2 , j D 1, . . . , n1 , which is obtained from W 2 by removing the rst column W 20 containing the bias nodes.
Formula i is a direct consequence of lemma 1. Moreover, due to the denition of the extension operator b we have, for all 1 · i · N, Proof.
µ
b( W 1 xO i ) ¡ y i D [W 2 W 2 ] W2 F 0 1
¶
1
F (W 1 xO
i)
¡ yi
2 2 1 D W 0 C W 1 F ( W xO i ) ¡ y i .
(3.2)
Using equation 3.2 and lemma 3 componentwise for the cost functional, equation 2.4, shows formula ii and ends the proof. 3.2 MLP with Several Hidden Layers. Next, we generalize the previous analysis to the case of several hidden layers.
Q 2 Rm3 £m2 and W N 2 Rm2 £m1 be given matrices, FQ D Diagf fQi Let W m 1 FN D Diagf fNi (¢) giD1 given diagonal function matrices, and v 2 Rm0 , R given vectors. The gradient matrix rW J ( W ) 2 Rm1 £m0 for the functional N FN (W Q FQ ( Wv ) ) ¡ y k2 is of the form J (W ) D 12 kW
Lemma 4.
2 (¢)gm i D 1 and y 2 m3
0 Q T DiagfFN 0 ( W Q FQ (Wv ) ) g rW J ( W ) D DiagfFQ ( Wv )g W
N T [W N FN ( W Q FQ ( Wv ) ) ¡ y ] v T . W
1456
Tommi K¨arkk¨ainen
Gradient-matrices rW l J (fW l g) , l D L, . . . , 1, for the cost functional equation 2.4 read as Theorem 2.
rW l J (fW l g) D
N 1 X ( ) d l [oO l¡1 ]T , N iD 1 i i
where d Li D e i D W L oO i(L¡1 ) ¡ y i , 0
(3.3) (
C 1) d li D Diagf( F l ) ( W l oO i( l¡1 ) )g ( W 1l ) T d i( l C 1) .
Proof.
(3.4)
Apply lemma 4 inductively.
The denition of d li contains the backpropagation of the output error in a layer-wise form. The compact presentation of the optimality system can be readily exploited in the implementation. More precisely, computation of 0 ( ) the activation derivatives ( F l ) ( W l oO i l¡1 ) can, when using the usual logistic sigmoid or hyperbolic tangent functions, be realized using layer outputs o li without additional storage. Hence, in the forward loop, it is enough to store the outputs of different layers for the backward gradient loop, where these same vectors can be overwritten by d li when the whole operation in equation 3.4 is implemented using a single loop. In modern workstations, such combination of operations yielding minimal amount of loops through the memory can decrease the computing time signicantly (K¨arkk¨ainen & Toivanen, 2001). 3.3 Some Corollaries. Next, we derive some corollaries of the layerwise optimality conditions. All of these results follow from having a linear transformation with biases in the nal layer of MLP. Moreover, whether one applies on-line or batch training makes no difference if the optimization problem 2.3 is solved with high accuracy, because then both methods yield ¤ ¤ rW l J (fW l g) ’ O for the local solution fW l g. Corollary 1.
For locally optimal MLP network satisfying the conditions in the-
orem 2: i. The average error
1 N
PN
¤ iD 1 e i
is zero.
ii. Correlation between the error vectors and the action of layer L ¡ 1 is zero. P ¤ ¤ O ( L¡1) ]T The optimality condition rW L J (fW l g) D N1 N D O i i D 1 e i [o ¤ ( ) ( ) (with the abbreviation oO i L¡1 D oO i L¡1 ) in theorem 2 can be written in the Proof.
MLP with Applications to Weight Decay
1457
nonextended form as N 1 X T e ¤i [1 ( o i( L¡1) ) ] D O . N iD 1
(3.5)
By taking the transpose of this, we obtain ¶ ¶ µ ¶ N µ N µ ( e ¤i ) T 1 X 1 1 X O ¤ T . D ( L¡1 ) ( ei ) D ( L¡1 ) O ( e ¤i ) T N iD 1 o i N i D 1 oi
(3.6)
This readily shows both results. In the next corollary, we recall some conditions concerning the nal weight matrix W L . We note that the assumption on nonsingularity below can be relaxed to pseudo-invertibility, which has been widely used to generate new training algorithms for the MLP networks (Di Martino, Fanelli, & Protasi, 1996; Wang & Chen, 1996). Corollary 2.
If the autocorrelation matrix A D
nal layer inputs v i D W L D B A ¡1
( L¡1) oi
is nonsingular,
for B D
WL
1 N
PN
vi b vTi iD 1 b
of the (extended)
can be recovered from the formula
N 1 X yi b v Ti . N iD 1
(3.7)
¤
Furthermore, if A ¤ for a minimizer fW l gLlD1 of equation 2.4 is nonsingular, then ¤ W L is unique. Proof.
W L satisfying the system
N 1 X [W L oO i( L¡1) ¡ yQ i ] [oO i( L¡1) ]T D O N iD 1
(3.8)
is independent of i. Hence, equation 3.8 can be written as WL
N N 1 X 1 X b vi b v Ti D yQ i b v Ti N i D1 N iD 1
,
W L A D B.
(3.9)
Multiplying both sides from right with A ¡1 gives equation 3.7 and ends the proof.
1458
Tommi K¨arkk¨ainen
Let us add one further observation concerning the above result. Because dim(span(b v i ) ) D 1, each matrix b vi b v Ti has only one nonzero eigenT value l D b vi b v i with the corresponding eigenvector b v i . This means that 1 · dim(span( A ) ) · min(N, nL¡1 C 1) . Hence, if N < nL¡1 C 1, then A must ¤ be singular and W L cannot be unique. On the other hand, if N > nL¡1 C 1, then all vectors b v i cannot be linearly independent. This shows that there exists a relation between the amount of learning data and an appropriate size of the last hidden layer for the LMS learning problem. 3.4 Some Consequences. We conclude this section by giving a list of observations and comments based on the previous results.
3.4.1 Final Layer with or Without Activation. If MLP also contains the ( ) nal layer activation F L (W L oO i L¡1 ) , then it follows from lemma 3 (choose N W D I) that equation 3.3 replaced with 0
d Li D Diagf( F L ) ( W L oO i( L¡1) ) g e i D D i e i
(3.10)
gives the corresponding sensitivity with regard to W L . In this case, we have in corollary 1 instead of equation 3.6, ¶ µ ¶ N µ ( e ¤i ) T 1 X O D . D ( L¡1 ) i ¤)T O ( ei N iD 1 o i Hence, the two formulations with and without activating the nal layer yield locally the same result for the zero-residual problem e ¤i D 0 for all 1 · i · N. This slightly generalizes the corresponding result derived by Moody and Antsaklis (1996), where bijectivity of the nal activation F L was also assumed. The use of 0-1 coding for the desired output vectors in classication enforces the weight vector of 1-neuron to the so-called saturation area of a sigmoidal activation where the derivative is nearly zero (Vitela & Reifman, 1997). For this reason, the error function with the nal layer activation has a nearly at region around such points, whereas the nal linear layer has no such problems. This is also evident from equation 3.10, which holds ( ) independently on e ¤i and oi L¡1 for D i D O . Indeed, the tests reported by Gorse, Shepherd, and Taylor (1997) (see also Figure 8.7 in Reed & Marks, 1999; de Villiers & Barnard, 1992) suggest that the network with sigmoid in the nal layer has more local minima than when using a linear nal layer. Furthermore, Japkowicz, Hanson, and Gluck (2000) pointed out that the reconstruction error surface for a sigmoidal autoassociator consists of multiple local valleys on the contrary to the linear one.
MLP with Applications to Weight Decay
1459
3.4.2 The Basic Interpretation. We recall that the transformation before the last linear layer with biases had no inuence whatsoever on the previous formal consequences. This suggests the following interpretation of the MLP action with the proposed architecture: nonlinear hidden layers produce the universal approximation capability of MLP, while the nal linear layer compensates the hidden action with the desired output in an uncorrelated and error-averaging manner (cf. Japkowicz et al., 2000). 3.4.3 The Basic Consequence. The simplest model of noise in regression is to assume that given targets are generated by y i D w ( x i ) C "i ,
(3.11)
where w ( x ) is the unknown stationary function and "i ’s are sampled from an underlying noise process. For equation 3.11, it follows from corollary 1i ¤ that a locally optimal MLP network N (fW l g) satises N h 1 X N N iD 1
N i 1 X ¤ (fW l g) ( xi ) ¡ w ( xi ) D ei . N iD 1
(3.12)
¤
Hence, this shows that every N (fW l g) treats optimally gaussian noise with zero mean (and enough samples). This result is not valid for other error functions or with nal layer activation (except in the impractical zero-residual case), but it remains valid for networks with input enlargements (e.g., Flake, 1998), output enlargements (Caruna, 1998), hidden layer modications (e.g., the linearly augmented feedforward network as proposed by van der Smagt & Hirzinger, 1998), and convex combinations of different locally optimal networks (averaged ensemble, e.g., Liu & Yao, 1999; Horn, Naftaly, & Intrator, 1998) for all cases when the nal bias is left untouched in the architecture. 3.4.4 Implicit Prior and Its Modication. Many authors writing on ANNs note that changing the prior frequency of different samples in the training data to favor the rare ones may improve the performance of the obtained network (e.g., LeCun et al., 1998; Yaeger, Webb, & Lyon, 1998). Corollary 1i gives a precise explanation how such a modication alters the nal result. For stochastic on-line learning, the prior frequency can be altered by controlling the feeding of examples, but for batch learning, one needs an explicit change in the error function for this purpose. For example, consider the classication problem with learning data from K different classes fCkgKkD1 so that P N D KkD 1 N k, where N k denotes the number of samples from the class Ck. To have an equal prior probability K1 for all classes in the classier (instead of Nk / N for the kth class), one needs to adjust the LMS cost functional to
J (fW l gLlD1 ) D
K X kD 1
1 X kW L oO i(L¡1 ) ¡ y i k2 . 2KN k i2C k
1460
Tommi K¨arkk¨ainen
Hence, using nonconstant weighting of the learning data in the cost functional incorporates prior information into the learning problem. For example, a time-series prediction could be based on larger weighting of the more recent samples, especially if one tries to simulate a slowly varying dynamical process (Park, El-Sharkawi, & Marks, 1991). It is also straightforward to improve the LMS learning problem when more knowledge on the variance of the output data is available (MacKay, 1992; van de Laar & Heskes, 1999). Finally, a locally weighted linear regression also adaptable for MLP learning is introduced by Schaal and Atkeson (1998).
3.4.5 Relation to Early Stopping. A popular way to try to improve the generalization power of MLP is to use early stopping (cutting, termination) of the learning algorithm (Bishop, 1995; Tetko & Villa, 1997; Prechelt, 1998). In most cases, early stopping (ES) is due to a cross-validation technique, but basically it also happens every time when a learning problem is solved inexactly, for example, because of a xed number of epochs or a bad learning rate. Formulation of general and robust conditions to stop different optimization algorithms prematurely is problematic and usually leads to a very large number of tests for validating different networks (Gupta & Lam, 1998). Moreover, Cataltepe, Abu-Mostafa, and Magdon-Ismail (1999) showed that if models with the same training error are chosen with equal probability, then the lowest generalization error is obtained by choosing the model corresponding to the training error minimum. This result is valid globally for linear models and locally, around the training error minimum, for nonlinear models. The success of ES is sometimes argued as due to producing smoother results when the network is initialized with small random numbers in the almost linear region of a sigmoidal activation function. This is certainly vague because different optimization methods follow different search paths, and nothing guarantees that even an overly hasty stopped training algorithm produces smooth networks. Hence, we believe that the role of ES is more related to equation 3.12, which for an inexact solution is not valid. Namely, for a nongaussian error distribution, ES may decrease the signicance of heavy error tails and, due to this fact, produce a result that generalizes better. Especially in this case, ES and weight decay (WD) are not alternatives to each other, because ES may improve the learning problem and WD can favor simpler models during learning. Notice, however, that because the usual bias-variance decomposition (Geman, Bienenstock, & Doursat, 1992) of the expected generalization error is also based on the least-squares estimation, the least-squares-based cross-validation technique may not be appropriate for a nongaussian error in the validation set. Changing the underlying gaussian assumption on noise should instead lead to the derivation of new cost functionals for the learning problem (Chen & Jain, 1994; Liano, 1996).
MLP with Applications to Weight Decay
1461
4 Numerical Results
Here, we describe numerical experiments based on the proposed techniques. All experiments are performed on an HP9000/J280 workstation (180 MHz PA8000 CPU), and the implementation is based on F77 (optimization and MLP realization) and Matlab for data pre- and postprocessing. (More comprehensive coverage of the computed examples is presented in K¨arkk¨ainen, 2000.) As an optimization software, we apply the limited memory quasi-Newton subroutine L-BFGS (Byrd, Lu, & Nocedal, 1995), which uses a sparse approximation of the BFGS-formula-based inverse of the Hessian matrix and is intended for solving large nonlinear optimization problems efciently. As a stopping criterion for the optimization, we use
J k ¡ J kC 1 ª · 106 ¤ epsmch, max | J k |, | J kC 1 |, 1 ©
where epsmch is the machine epsilon (» 10¡16 in the present case). This choice reects our intention to solve the optimization problem with (unnecessary) high precision for test purposes. The main ingredient in the L-BFGS software is that due to the limited-memory Hessian update, the computational complexity is only O ( n ) , where n is the total amount of unknowns. For ordinary quasi-Newton methods, the O (n2 ) consumption due to full Hessian has been one of the main reasons preventing the application of these methods for learning problems with larger networks. There exists a large variety of different tests comparing backpropagation (gradient descent with constant learning rate), conjugate gradient, and second-order methods (Gauss-Newton, Levenberg-Marquart, Hessian approximation, and quasi-Newton) for MLP training (e.g., Bishop, 1995; Hagan & Menhaj, 1994; Magoulas, Vrahatis, & Androulakis, 1999; McKeown, Stella, & Hall, 1997; Wang & Lin, 1998). The difculty of drawing reliable conclusions between the quality of different methods is that in addition to how to solve it (training method), as (or even more) important is also to consider what to solve (learning problem) to connect our approach to an application represented by a nite set of samples. According to Gorse et al. (1997), a quasi-Newton method can survey a larger amount of local minima than the BP and CG methods, but to truly enforce search through different minima when using a local gradient-based method, one must either start the training using multiple initial congurations or apply some global optimization strategy as on outer iteration. In the numerical experiments, we consider only simple examples, because the emphasis here is on testing the algorithms and different formulations rather than on complex applications of MLPs. Therefore, we also restrict ourselves to the perceptron with only one hidden layer. Concerning the discussion between one or more hidden layers, we refer to Reed
1462
Tommi K¨arkk¨ainen
and Marks (1999) and Tamura and Tateishi (1997). Finally, notice that more complex (nonconvex) optimization problems with an increased number of local minima must be solved when training a network with several hidden layers. In order to generate less regular nonlinear transformations without affecting the scale of weights when the hidden layer is enlarged, we choose k-tanh functions of the form t k (a ) D
2 ¡ 1, 1 C exp(¡2 k a )
k D 1, . . . , n1 ,
(4.1)
to activate the hidden neurons (McLean, Bandar, & O’Shea, 1998). Although we at the same time introduce some kind of ordering for the hidden layer by using different activation functions for each neuron, the symmetry problem related to the hidden neurons (e.g., Bishop, 1995) remains, as can be seen by a simple rescaling argument. Use of an even more general mixture of different activation functions in the hidden layer is suggested by Zhang and Morris (1998). Finally, by increasing the nonsmoothness of the hidden activation functions, and hence the whole MLP mapping, decreases the regularity of the learning problem and thus also the convergence rate of rst- and second-order optimization methods (Nocedal & Wright, 1999). 4.1 Approximation of Noisy Function.
Reconstruction of function f ( x ) D sin(2p x ) , x 2 I D [0, 2p ], which is corrupted with normally distributed (quasi-)random noise. The input data fxi g result from the uniform discretization of the interval I with the step size h D 0.1. Points in the output data are taken as yi D f ( xi ) C d ei for d D 0.3 and ei 2 N (0, 1). Altogether, we have in this example N D 63 and n2 D n 0 D 1. Example 1.
In the following experiments we have solved the optimization problem, equation 2.3, with prescaled learning data into [¡1, 1] starting from 10 random initial guesses from the range (¡1, 1) for the weight matrices ( W 1 , W 2 ) . For an overview and study of different initialization techniques we refer to Thimm and Fiesler (1997) and LeCun et al. (1998). Let us make some comments based on Table 1: Local minima: Even for the smallest network (n1 D 2) and especially
for larger ones, there exist a lot of local minima in the optimization problem. Moreover, the local minima are strict in the sense that they correspond to truly different values of the cost functional and not just different representations (symmetries) of the same MLP transformation. Condition i in Corollary 1: Is valid with a precision related to the
stopping criterion.
41 120 153 0.010 0.010 0.0086
2 3 4 5 6 7
234 1079 341 0.013 0.013 0.013
Maximum
126 427 250 0.012 0.011 0.010
Mean
0.23 0.42 0.47 4e-7 3e-7 2e-7
Minimum 0.52 0.80 0.60 2e-5 3e-5 2e-5
Maximum
| eN¤ | 0.39 0.60 0.54 6e-6 8e-6 7e-6
Mean 0.014 (2) 0.013 0.012 (2) 237 183 440
Minimum 0.032 0.024 0.014 804 737 1541
Maximum
Its
0.018 0.014 0.013 430 466 927
Mean 2e-8 4e-7 1e-6 0.55 0.51 0.67
Minimum 4e-6 2e-5 8e-6 0.76 0.75 0.95
Maximum
CPU
2e-6 4e-6 4e-6 0.64 0.66 0.80
Mean
Notes: The minimum, maximum, and average mean values correspond to the 10 solutions of the optimization problem for the following quantities: J ¤ is the nal value of the cost functional. If the minimum or maximum value (with tolerance e D 10¡6 ) is scored more than once, the number of instances is included in parentheses. | eN¤ | denotes the absolute value of the average output error over the learning data as stated in corollary 1i. Its D total number of function/gradient evaluations (always computed together). CPU D time in seconds for solving the optimization problem.
Minimum
n1
J¤
Table 1: Computational Results in Example 1 Without Regularization.
MLP with Applications to Weight Decay 1463
1464
Tommi K¨arkk¨ainen Efciency: CPU time is negligible in this small example, but the num-
ber of iterations in the optimization varies a lot. Generalization: Visually (K¨arkk¨ainen, 2000) the best result is obtained
using the MLP corresponding to the minimal value of J ¤ for n1 D 2. However, MLP corresponding to the maximal value of J ¤ for n1 D 7 also gives a good result. This illustrates the difculty of naming the optimal network even in a simulated example. Moreover, from the large variation of the number of iterations, we conclude that when a xed number of iterations is taken in the learning algorithm, one has no knowledge on the error between the obtained weights and the true (local) solution of the optimization problem. 4.1.1 Regularization of MLP Using Weight Decay. Weight decay (WD) is a popular technique for pruning an MLP (Goutte & Hansen, 1997; Gupta & Lam, 1998; Hintz-Madsen, Hansen, Larsen, Pederson, & Larsen, 1998; Ormoneit, 1999; Reed & Marks, 1999). In WD, the basic LMS error function is augmented with a WD term R ( W lij ), which imposes some restriction on the generality (universality) of the MLP transform to prevent overlearning. Here we start with the simplest possible strictly convex form with P regard to the unknown weights by considering initially R ( W lij ) D b / 2 l,i, j ( W lij ) 2 , where b is the WD parameter. The choice of having only a single coefcient makes sense, because the network inputs and outputs of the hidden layer are enforced in the same range. Moreover, for the gaussian noise with known estimate of variance, b is actually the inverse of the Lagrange multiplier for the corresponding equality or inequality constraint and therefore single-valued (Chambolle & Lions, 1997). In general the “best” value of b is related to both the complexity of the MLP transformation and the (usually unknown) amount of noise contained in the learning data (Scherzer, Engl, & Kunisch, 1993). There exist, however, various techniques for obtaining an effective choice of b (e.g., Bishop, 1995; Rohas, 1996; Rognvaldsson, ¨ 1998). In addition to strict convexity, some particular reasons for choosing the proposed form of WD with the quadratic penalty function p (w ) D |w| 2 are: This form improves the convexity of the cost functional and therefore makes the learning problem easier to solve. The proposed form forces weights in the neighborhood of zero (similarly to prior distributions with zero mean suggested by Neal, 1996), thus further balancing their scale in a gradient-based optimization algorithm (Cavalieri & Mirabella, 1999). The smoothing property is due to the fact that the activation functions tk ( a ) are nearly linear around zero, although the size of the linear region is decreasing as k is increasing. Furthermore, rst derivatives of the activation functions are most informative around zero, so that the proposed form of WD is
MLP with Applications to Weight Decay
1465
also helpful to prevent saturation of weights by enforcing them in the neighborhood of this transient region (Kwon & Cheng, 1996; Vitela & Reifman, 1997). One drawback of quadratic WD is that it produces weight matrices with groups of small components even if a choice of one large weight instead could be sufcient. This can yield unnecessarily large networks, even if the overlearning can be prevented. On the other hand, this property increases the fault tolerance of the trained network due to the so-called graceful degradation (Reed & Marks, 1999). Let us comment on some of the difculties of other forms of WD for individual weights suggested in the literature (Goutte & Hansen, 1997; Gupta & Lam, 1998; Saito & Nakano, 2000): l1 ¡formulation: Each weight w is regularized using a penalization p ( w ) D |w|. This function is convex but not strictly so. A severe difculty is that because the derivative of p (w ) is multivalued for w D 0, the resulting optimization problem is nonsmooth, that is, only subdifferentiable (Makel ¨ a¨ & Neittaanm a¨ ki, 1992). In general, function |w| p for 1 < p < 2 belongs only to the Holder ¨ space C1,p¡1 (Gilbarg & Trudinger, 1983), so that assumptions for convergence of gradient descent (on batch-mode Lipschitz continuity of gradient, for on-line stochastic iteration C2 continuity), CG (Lipschitz continuity of gradient), and especially quasi-Newton methods (C2 -continuity) are violated (Haykin, 1994; Nocedal & Wright, 1999). As documented, for example, for MLP by Saito and Nakano (2000) and for image restoration by K¨arkk¨ainen, Majava, and M¨akel¨a (2001), this yields nonconvergence of ordinary training algorithms, when the cost functional does not fulll the required smoothness assumptions. Furthermore, even if a smoothed p counterpart w2 C e for e > 0 is introduced, this formulation is either (for small e ) too close to the nonsmooth case so that again the convergence fails, or otherwise (for larger e ), the smoothed formulation differs substantially from the original one. To conclude, for such nonsmooth optimization problems, one needs special algorithms (K¨arkk¨ainen & Majava, 2000a, 2000b; K¨arkk¨ainen et al., 2001). Finally, these same difculties also concern the so-called robust backpropagation where the P error function is dened as i kN (fW l g) (x i ) ¡ y i kl1 (Kosko, 1992). Mixed WD: Each weight is regularized using mixed penalization p ( w ) D
w2 / (1 C w2 ) . This form is nonconvex, producing even more local minima to the optimization problem, and it favors both small and large weights, thus destroying the balance in scales of different weights.
The numerical results reported by Saito and Nakano (2000) emphasize the above difculties. Whereas the combination of the quasi-Newton optimization algorithm and quadratic WD drastically improved both the con-
1466
Tommi K¨arkk¨ainen
vergence of the training algorithm and the generalization performance of the trained MLP, the l1 -penalized problem was not convergent, and the mixed form of WD was very unstable. However, corollary 1 and the role of bias terms as a shift from the origin © ªL raise the question: Which components of the weight matrices W l lD 1 should be regularized and which not? Certainly, if condition i in corollary 1 is required for the resulting MLP, one should exclude the bias terms of the nal layer W L from WD. Similarly, for both conditions to hold, one should leave out all components of W L from WD. Hence, we apply the quadratic WD only to selected components of the weight-matrices ( W 1 , W 2 ). We distinguish four cases: I Regularize all other components except the bias terms W 20 in the weight matrix W 2 . II Exclude all components of W 2 from the regularization. III Exclude all bias terms of ( W 1 , W 2 ) from the regularization (Holm-
strom, ¨ Koistinen, Laaksonen, & Oja, 1997). IV Exclude all components of W 2 and bias terms of W 1 from the regu-
larization. Let us state one further observation concerning the nonregularP ¤ ization of W 20 . Using the error-average formula N1 N i D 1 e i D O of corollary 1 ¤ and the expression for ei according to equation 3.2 yields (cf. Bishop, 1995) Remark.
2,¤
W0
D ¡
N 1 X [W 2,¤ F ( W 1,¤ xO i ) ¡ y i ]. N iD 1 1
Hence, increased coercivity and thereby uniqueness with regard to ( W 21 , W 1 ) immediately affect the uniqueness of the bias W 20 as well. The results corresponding to the above cases are presented in Tables 2 through 5. For all experiments, we have chosen b D 10¡3 according to some prior tests, which also indicated that the results obtained here were not very sensitive to the choice of b. Let us state some observations based on Tables 1 through 5: Local minima: For regularization methods I and III, the minimal cost
function values are scored more than once when n1 is small. However, there still exist a lot of local minima for larger networks. Condition i in corollary 1: Is valid for all regularization methods with
a precision related to the stopping criterion. Efciency: By means of the number of iterations and the CPU time,
regularization methods I and III improved the performance, whereas
0.022 (5) 0.0195 0.017 0.017 0.016 0.016
2 3 4 5 6 7
0.025 (5) 0.0199 (2) 0.019 0.018 0.017 (2) 0.017
Maximum
0.023 0.0197 0.018 0.017 0.017 0.016
Mean
3e-8 3e-8 1e-7 1e-7 5e-7 2e-7
Minimum 4e-6 5e-6 4e-6 2e-5 1e-5 2e-5
Maximum
| eN¤ | 1e-6 2e-6 2e-6 7e-6 6e-6 5e-6
Mean 48 76 88 142 145 227
Minimum 84 149 208 245 382 395
Maximum
Its
65 101 146 188 244 327
Mean
Minimum
0.015 0.014 0.014 0.014 0.013 0.013
n1
2 3 4 5 6 7
0.016 0.015 0.015 0.015 (2) 0.014 0.014
Maximum
Jb¤
0.015 0.015 0.015 0.014 0.014 0.014
Mean
2e-6 2e-7 8e-8 2e-6 3e-6 1e-7
Minimum 8e-5 1e-5 2e-5 3e-5 3e-5 1e-5
Maximum
| eN¤ | 2e-5 3e-6 8e-6 1e-5 1e-5 4e-6
Mean
Table 3: Computational Results in Example 1 for Regularization II.
67 131 257 329 607 1213
Minimum 554 1009 963 1283 2402 2472
Maximum
Its
231 383 414 832 1331 1599
Mean
Note: Here and in the sequel, Jb¤ refers to the value of the cost functional containing the regularization term.
Minimum
n1
Jb¤
Table 2: Computational Results in Example 1 for Regularization I.
0.33 0.44 0.48 0.61 0.72 0.86
Minimum
0.25 0.33 0.38 0.46 0.47 0.57
Minimum
0.67 0.78 0.79 0.87 1.10 1.13
Maximum
CPU
0.35 0.43 0.51 0.55 0.63 0.65
Maximum
CPU
0.49 0.59 0.61 0.75 0.88 0.96
Mean
0.30 0.37 0.44 0.50 0.55 0.62
Mean
MLP with Applications to Weight Decay 1467
0.022 (5) 0.019 (4) 0.017 (2) 0.017 0.016 0.016
2 3 4 5 6 7
0.025 (5) 0.025 0.019 0.019 0.017 0.018
Maximum
0.023 0.019 0.018 0.017 0.017 0.016
Mean
1e-7 3e-8 3e-7 3e-7 3e-7 2e-6
Minimum 3e-6 7e-6 6e-6 1e-5 7e-6 3e-5
Maximum
| eN¤ | 1e-6 2e-6 2e-6 4e-6 3e-6 9e-6
Mean
Minimum
0.015 0.014 0.014 0.014 0.013 0.012
n1
2 3 4 5 6 7
0.016 0.016 0.015 0.015 (2) 0.014 0.014
Maximum
Jb¤
0.015 0.015 0.015 0.014 0.014 0.014
Mean
4e-7 9e-7 2e-7 9e-8 2e-7 1e-8
Minimum 2e-5 5e-5 3e-5 1e-5 2e-5 2e-5
Maximum
| eN¤ | 1e-5 1e-5 9e-6 5e-6 7e-6 7e-6
Mean
Table 5: Computational Results in Example 1 for Regularization IV.
Minimum
n1
Jb¤
Table 4: Computational Results in Example 1 for Regularization III.
73 110 151 244 361 321
Minimum
49 73 87 80 104 118
Minimum
913 953 744 2436 2135 3119
Maximum
Its
93 173 172 205 199 537
Maximum
Its
235 352 344 1176 1184 1368
Mean
71 109 121 141 153 269
Mean
0.31 0.39 0.47 0.55 0.62 0.61
Minimum
0.25 0.33 0.36 0.36 0.40 0.44
Minimum
0.75 0.77 0.74 1.09 1.05 1.19
Maximum
CPU
0.36 0.48 0.48 0.51 0.51 0.70
Maximum
CPU
0.47 0.56 0.58 0.82 0.84 0.89
Mean
0.30 0.39 0.41 0.45 0.47 0.56
Mean
1468 Tommi K¨arkk¨ainen
MLP with Applications to Weight Decay
1469
methods II and IV made it worse compared to the unregularized approach. Generalization: All regularization approaches improve the general-
ization compared to the unregularized problem by preventing oscillation in the nal mapping generated by the MLP. When n1 is increased, regularization methods I and III seem to be more stable and thus more robust than methods II and IV. Conclusion: From the four regularization methods tested, I and III are
preferable to II and IV in every respect. One cannot make a difference between I and III, and we note that centering the input data already decreases the signicance of the hidden bias. 4.2 Classication.
We consider the well-known benchmark to classify Iris owers (cf. Gupta & Lam, 1998) according to measurements obtained from the UCI repository (Blake & Merz, 1998). In this example, we have n2 D 3 (number of classes), n 0 D 4 (number of features), and initially 50 samples from each of the three classes. Due to the choice of the k-tanh activation functions prescaling of the output data fyi g into the range [¡1, 1] destroys the linear independence between different classes. Example 2.
In Tables 6 through 11, we have solved the classication (optimization) problem again 10 times using 40 random samples from each class as the learning set and the remaining 10 samples from each class as the test set. These two data sets have been formed using ve different random permutations of the initial data realizing a simple cross-validation technique. Hence, we have N D 120 and the remaining 30 samples in the test set for each permutation. In Tables 9 through 11, we have added noise to the inputs in permuted learning sets. First, means with regard to each four features within the three classes have been computed. After that, the class mean vectors have been multiplied component-wise with a noise vector d ei for d D 0.3 and ei 2 N (0, 1) , and this has been added to the unscaled learning data. To this end, we derive some nal observations based on the computational results, especially in example 2: Local minima: In all numerical tests for unregularized and regularized
learning problems, the L-BFGS optimization algorithm was convergent. This suggests that by using the proposed combination of linear nal layer, prescaling, and mixed activation, we are able to deal with the at error surfaces during the training. Furthermore, because usually in all 10 test runs of one learning problem, a different value of the cost functional is obtained, the different results are revealing “true”
0 0 0 0 0
0
1 2 3 4 5
Mean
0.4
0 0 1 0 1
Maximum
0.4
0 0 1 0 1
Mean
0
0 0 0 0 0
CL
1.6
2 3 1 2 0
CT
C
103
109 81 106 129 92
Minimum
807
534 664 1243 891 702
Maximum
Its
331
239 290 492 349 286
Mean
0.34
0.37 0.30 0.35 0.36 0.31
Minimum
0.72
0.60 0.65 0.93 0.75 0.67
Maximum
CPU
0.50
0.47 0.49 0.57 0.51 0.48
Mean
Notes: The rst column, P, contains the permutation index. In the second column, CL , minimum, maximum, and (rounded) mean values for the number of false classications in the learning set are given. The third column, C, includes numbers of false classications in the learning and test sets (denoted with CL and CT ) for the optimal perceptron N P¤ satisfying CL (N P¤ ) C CT (N P¤ ) · CL (N P ) C CT (N P ) over all networks encountered during the 10 optimization runs for the current permutation P.
Minimum
P
CL
Table 6: Computational Results in Example 2 Without Regularization for n1 D 9.
1470 Tommi K¨arkk¨ainen
0 0 0 0 0
0
1 2 3 4 5
Mean
0.2
0 0 0 0 1
Maximum
0
0 0 0 0 0
Mean
0
0 0 0 0 0
CL
1.4
2 3 0 2 0
CT
C
555
517 443 742 426 646
Minimum
1389
1280 814 1797 1451 1604
Maximum
Its
823
843 551 1047 792 884
Mean
0.62
0.60 0.57 0.68 0.56 0.69
Minimum
Minimum
0 0 0 0 0
0
P
1 2 3 4 5
mean
1
0 0 2 0 3
Maximum
CL
0.4
0 0 1 0 1
Mean
0
0 0 0 0 0
CL
1.6
2 3 0 3 0
CT
C
413
369 484 418 382 413
Minimum
869
939 923 868 643 973
Maximum
Its
590
543 602 677 507 619
Mean
0.56
0.56 0.58 0.56 0.54 0.56
Minimum
Table 8: Computational Results in Example 2 for Regularization III with e D 10¡3 and n1 D 9.
Minimum
P
CL
Table 7: Computational Results in Example 2 for Regularization I with e D 10¡3 and n1 D 9.
0.77
0.77 0.80 0.78 0.69 0.82
Maximum
CPU
0.92
0.93 0.72 1.04 0.91 1.00
Maximum
CPU
0.63
0.61 0.64 0.67 0.60 0.64
Mean
0.74
0.76 0.63 0.82 0.71 0.76
Mean
MLP with Applications to Weight Decay 1471
0 0 0 0 0
0
1 2 3 4 5
Mean
1.6
1 1 3 1 2
Maximum
0.8
0 1 1 1 1
Mean
0
0 0 0 0 0
CL
1.4
2 2 1 2 0
CT
C
224
275 240 175 146 284
Minimum
2653
3008 460 2680 1655 5464
Maximum
Its
754
968 339 948 589 927
Mean
0.45
0.49 0.47 0.41 0.39 0.50
Minimum
1.06
1.19 0.57 1.16 0.98 1.42
Maximum
CPU
0.65
0.73 0.52 0.73 0.61 0.65
Mean
Minimum
0 0 0 0 0
0
P
1 2 3 4 5
Mean
0.6
1 0 0 1 1
Maximum
CL
0.2
0 0 0 0 1
Mean
0
0 0 0 0 0
CL
1.6
2 3 1 2 0
CT
C
527
492 486 579 616 464
Minimum
1114
1196 1059 982 1120 1211
Maximum
Its
773
724 782 739 862 756
Mean
0.60
0.59 0.58 0.62 0.64 0.58
Minimum
0.84
0.88 0.81 0.78 0.84 0.88
Maximum
CPU
0.70
0.68 0.70 0.69 0.74 0.69
Mean
Table 10: Computational Results in Example 2 for Regularization I with e D 10¡3 , n1 D 9, and Noisy Learning Set.
Minimum
P
CL
Table 9: Computational Results in Example 2 Without Regularization for n1 D 9 and Noisy Learning Set.
1472 Tommi K¨arkk¨ainen
Minimum
0 0 0 0 0
0
P
1 2 3 4 5
mean
1.8
1 2 3 2 1
Maximum
CL
0.8
0 1 1 1 1
Mean
0
0 0 0 0 0
CL
1.6
2 3 1 2 0
CT
C
421
434 365 464 386 458
Minimum
955
1024 941 800 1007 1003
Maximum
Its
672
710 580 653 725 690
Mean
0.56
0.56 0.54 0.58 0.54 0.58
Minimum
0.78
0.80 0.76 0.71 0.83 0.79
Maximum
CPU
0.66
0.68 0.62 0.65 0.69 0.67
Mean
Table 11: Computational Results in Example 2 for Regularization III with e D 10¡3 , n1 D 9, and Noisy Learning Set.
MLP with Applications to Weight Decay 1473
1474
Tommi K¨arkk¨ainen
local minima instead of just some symmetrical representations of the same MLP transformation. Efciency: By comparing the number of iterations and the CPU time,
the unregularized problem seems to be easier to solve than the regularized one (with given b) for the clean learning data. On average, there is no real difference between the unregularized and regularized problems for the noisy learning data, but the variation in the number of iterations is signicantly larger for the unregularized approach. Com parison: There seems to be no signicant difference between the
quality of perceptrons resulting from the unregularized and regularized problems, with or without additional noise. This suggests that the iris learning data are quite stable and contain only a few cases with nongaussian degradations. Moreover, according to the observations in example 1, one reason for the quite similar behavior can be the size of n1 , which is not very large compared to n 0 and n2 . Finally, we cannot favor either of the two regularization methods I and III according to these results. Generalization: The amount of false classications in the test set is
between 0 and 3, that is, between 0% and 10% in all test runs. A single preferable choice (an ensemble, of course, is another possibility) for an MLP classier would probably be the one with median of false classications in CT obtained by using permutation 1 or 4, even if for the fth permutation, an optimal classier was found according to the given data. Notice that the amount of variation in CT over the different permutations provides useful information on the quality of the data. Appendix A: Proofs of Lemmas Proof of Lemma 1.
Functional J ( W ) in a component-wise form reads as 0 12 m2 m1 X 1X @ J (W ) D wij vj ¡ yi A , 2 i D1 j D 1
where i represent the row and j the column index of W , respectively. A straightforward calculation shows that 0 1 0 1 m1 m1 X X @J @J D@ wij vj ¡ yi A v k ) D@ wij vj ¡ yi A v T w @wik @ i,: jD 1 jD 1 D [W v ¡ y ]i v T )
@J @W
D [W v ¡ y ] v T .
(A.1)
Here we have used Matlab-type abbreviations for the ith row vector w i,: of matrix W .
MLP with Applications to Weight Decay Proof of Lemma 2.
1475
As above, consider the component-wise form of the
functional Á !2 m2 m1 m2 X X 1X J (u) D wji fi ( ui ) ¡ yj D Jj (u ) , 2 j D 1 i D1 jD 1 where Jj ( u ) D @Jj @u k
D
1 2
Pm 1
iD 1
" m1 X iD 1
(A.2)
(wji fi ( ui ) ¡ yj ) 2 D 12 [W F ( u ) ¡ y ]j2 . Then # 0
0
wji fi ( ui ) ¡ yj wjk fk ( u k ) D wjk fk (u k ) [W F (u ) ¡ y]j . (A.3)
Remember here that u is a vector with m2 components. As in lemma 1, we get the derivative of J (u ) with respect to u by treating k as row index and j as column index. Proceeding like this, we obtain from equation A.3 0
1 0 0 ... w11 f1 ( u1 ) wm2 1 f1 ( u1 ) B C .. .. C [W F ( u ) ¡ y ] .. ru J (u ) D B @ . A . . 0 0 ( ) ( ) . . . wm2 m1 fm1 um1 w1m1 fm1 um1 ± ² T 0 [W F ( u ) ¡ y ], (A.4) D W F (u) 0
0
which is the desired result when (W F ( u ) ) T is replaced for DiagfF (u )g W T . Here we introduce the basic Lagrangian technique to simplify the calculations. As the rst step, we dene the extra variable u D Wv and instead of the original problem consider the equivalent constraint optimization problem, Proof of Lemma 3.
1 N min JQ( u , W ) D kW F ( u ) ¡ y k2 2
(u,W )
subject to u D W v ,
(A.5)
where u and W are treated as independent variables linked together by the given constraint. The Lagrange functional associated with equation A.5 reads as
L ( u , W , ¸ ) D JQ(u , W ) C ¸T (u ¡ W v ) ,
(A.6)
where ¸ 2 Rm1 contains the Lagrangian variables for the constraint. Noticing that the values of functions JQ( u , W ) in equation A.5 and L ( u , W , ¸ ) in equation A.6 coincide if the constraint u D W v is satised, it follows that a solution of equation A.5 is equivalently characterized by the saddlepoint conditions r L ( u , W 1 , ¸ ) D 0. Therefore, we compute the derivatives ru L , rW L , and r¸ L (in a suitable form) for the Lagrangian.
1476
Tommi K¨arkk¨ainen
The gradient vector ru L ( u , W , ¸ ) is, due to lemma 2, of the form 0 N T [W F ( u ) ¡ y ] C ¸ . ru L D ru JQ( u ) C ¸ D DiagfF ( u ) g W
(A.7)
Other derivates for the Lagrangian are given by rW L D ¡¸ vT and r¸ L D 0 N T [W F ( u ) ¡ y ] due to equau ¡ Wv . Using formula ¡¸ D DiagfF ( u ) g W tion A.7 and substituting this into rW L , we obtain
rW L
N T [W N F ( u ) ¡ y] v T . D DiagfF ( u ) g W 0
(A.8)
Due to the equivalency of equations A.5 and A.6 with the original problem, the desired gradient matrix is given by equation A.8 when Wv is substituted for u . Proof of Lemma 4. To simplify the calculations, we now introduce two Q FQ ( u ) D W Q FQ ( Wv ) . As in the previous extra variables u D Wv and uQ D W
proof, we rst consider the constraint optimization problem: 1 N N min JQ( u , uQ , W ) D kW F ( uQ ) ¡ y k2 2 ( u, uQ , W )
Q FQ ( u ) . (A.9) s.t. u D W v , uQ D W
The Lagrange functional associated with equation A.9 reads as Q ) D JQ( u , uQ , W ) C ¸T ( u ¡ W v ) C ¸ Q T ( uQ ¡ W Q FQ ( u ) ). L ( u , uQ , W , ¸, ¸
(A.10)
Using similar techniques as in the previous proof, derivates for the saddleQ C Q FQ 0 ( u )]T ¸ point conditions of the Lagrangian are given by ru L D ¡[W 0( ) T N N ( ) T Q N N ¸, ruQ L D [W F uQ ] [W F uQ ¡ y ] C ¸, rW L D ¡¸ v , r¸ L D u ¡ Wv , Q FQ (u ) . From r( u , uQ ) L D O we obtain and r Q L D uQ ¡ W
¸
Q D ¡[W Q FQ 0 (u )]T ¸ Q FQ 0 (u ) ]T [W N FN 0 ( uQ ) ]T [W N FN ( uQ ) ¡ y ], ¸ D [W
(A.11)
which, substituted into rW L , yields
rW L
Q T DiagfFN 0 (uQ )g W N T [W N FN ( uQ ) ¡ y ] vT . D DiagfFQ ( u ) g W 0
(A.12)
Q FQ (u ) , This, together with the original expressions u D W v and uQ D W proves the result. Acknowledgments
I thank Pasi Koikkalainen for many enlightening discussions concerning ANNs. The comments and suggestions of the unknown referee improved and claried the presentation signicantly, especially through the excellent reference Orr and Muller ¨ (1998).
MLP with Applications to Weight Decay
1477
References Bazaraa, M. S., Sherali, H. D., & Shetty, C. M. (1993). Nonlinear programming: Theory and algorithms (2nd ed.). New York: Wiley. Bertsekas, D. P. (1982). Constrained optimization and Lagrange multiplier methods. New York: Academic Press. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press. Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases. Available on-line: http://www.ics.uci.edu/»mlearn/MLRepository.html. Byrd, R. H., Lu, P., & Nocedal, J. (1995). A limited memory algorithm for bound constrained optimization, SIAM Journal on Scientic and Statistical Computing, 5, 1190–1208. Caruna, R. (1998). A dozen tricks with multitask learning. In G. B. Orr & K.R. Muller ¨ (Eds.), Neural networks: Tricks of the trade. Berlin: Springer-Verlag. Cataltepe, Z., Abu-Mostafa, Y. S., & Magdon-Ismail, M. (1999). No free lunch for early stopping. Neural Computation, 11, 995–1009. Cavalieri, S., & Mirabella, O. (1999). A novel learning algorithm which improves the partial fault tolerance of multilayer neural networks. Neural Networks, 12, 91–106. Chambolle, A., & Lions, P. L. (1997). Image recovery via total variation minimization and related problems. Numerische Mathematik, 76, 167–188. Chen, D. S., & Jain, R. C. (1994). A robust back propagation learning algorithm for function approximation. IEEE Transactions on Neural Networks, 5, 467–479. de Villiers, J., & Barnard, E. (1992). Backpropagation neural networks with one and two hidden layers. IEEE Transactions on Neural Networks, 1, 136–141. Diamantaras, K. I., & Kung, S. Y. (1996). Principal component neural networks: Theory and applications. New York: Wiley. Di Martino, M., Fanelli, S., & Protasi, M. (1996). Exploring and comparing the best “direct methods” for the efcient training of MLP-networks. IEEE Transactions on Neural Networks, 7, 1497–1502. Flake, G. W. (1998). Square unit augmented, radially extended, multilayer perceptrons. In G. B. Orr & K.-R. Muller ¨ (Eds.), Neural networks: Tricks of the trade. Berlin: Springer-Verlag. Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4, 1–58. Gilbarg, D., & Trudinger, N. S. (1983). Elliptic partial differential equations of second order. Berlin: Springer-Verlag. Gorse, D., Shepherd, A. J., & Taylor, J. G. (1997). The new ERA in supervised learning. Neural Networks, 10, 343–352. Goutte, C., & Hansen, L. K. (1997). Regularization with a pruning prior. Neural Networks, 10, 1053–1059. Griewank, A. (2000). Evaluating derivatives: Principles and techniques of algorithmic differentiation. Philadelphia: SIAM. Gupta, A., & Lam, S. M. (1998). Weight decay backpropagation for noisy data. Neural Networks, 11, 1527–1137.
1478
Tommi K¨arkk¨ainen
Hagan, M. T., & Menhaj, M. B. (1994). Training feedforward networks with the Marquardt algorithm. IEEE Transactions on Neural Networks, 5, 989–993. Haykin, S. (1994). Neural networks; A comprehensive foundation. New York: Macmillan. Hintz-Madsen, M., Hansen, L. K., Larsen, J., Pedersen, M. W., & Larsen, M. (1998). Neural classier construction using regularization, pruning and test error estimation. Neural Networks, 11, 1659–1670. Holmstr om, ¨ L., Koistinen, P., Laaksonen, J., & Oja, E. (1997). Neural and statistical classiers—taxonomy and two case studies. IEEE Transactions on Neural Networks, 8, 5–17. Horn, D., Naftaly, U., & Intrator, N. (1998). Large ensemble averaging. In G. B. Orr & K.-R. Muller ¨ (Eds.), Neural networks: Tricks of the trade. Berlin: Springer-Verlag. Japkowicz, N., Hanson, S. J., & Gluck, M. A. (2000). Nonlinear autoassociation is not equivalent to PCA. Neural Computation, 12, 531–545. K¨arkk¨ainen, T. (2000). MLP-network in a layer-wise form:Derivations, consequences, ¨ Finland: Deand applications to weight decay (Tech. Rep. No. C1). Jyv a¨ skyla, partment of Mathematical Information Technology, University of Jyv¨askyla¨ . K¨arkk¨ainen, T., & Majava, K. (2000a). Nonmonotone and monotone active-set methods for image restoration, Part 1: Convergence analysis. Journal of Optimization Theory and Applications, 106, 61–80. K¨arkk¨ainen, T., & Majava, K. (2000b). Nonmonotone and monotone active-set methods for image restoration, Part 2: Numerical results. Journal of Optimization Theory and Applications, 106, 81–105. K¨arkk¨ainen, T., Majava, K., & Makel¨ ¨ a, M. M. (2001). Comparison of formulations and solution methods for image restoration problems. Inverse Problems, 17, 1977–1995. K¨arkk¨ainen, T., & Toivanen, J. (2001). Building blocks for odd-even multigrid with applications to reduced systems. Journal of Computational and Applied Mathematics, 131, 15–33. Kosko, B. (1992). Neural networks and fuzzy systems: A dynamical systems approach to machine intelligence. Englewood Cliffs, NJ: Prentice-Hall. Kwon, T. M., & Cheng, H. (1996). Contrast enhancement for backpropagation. IEEE Transactions on Neural Networks, 7, 515–524. LeCun, Y., Bottous, L., Orr, G. B., & Muller, ¨ K.-R. (1998). Efcient backprop. In G. B. Orr & K.-R. Muller ¨ (Eds.), Neural networks: Tricks of the trade. Berlin: Springer-Verlag. Liano, K. (1996). Robust error measure for supervised neural network learning with outliers. IEEE Transactions on Neural Networks, 7, 246–250. Liu, Y., & Yao, X. (1999). Ensemble learning via negative correlation. Neural Networks, 12, 1399–1404. MacKay, D. J. C. (1992). A practical Bayesian framework for backpropagation networks. Neural Computation, 4, 448–472. Magoulas, G. D., Vrahatis, M. N., & Androulakis, G. S. (1999).Improving the convergence of the backpropagation algorithm using learning rate adaptation methods. Neural Computation, 11, 1769–1796.
MLP with Applications to Weight Decay
1479
Makel¨ ¨ a, M. M., & Neittaanm¨a ki, P. (1992). Nonsmooth optimization: Analysis and algorithms with applications to optimal control. Singapore: World Scientic. McKeown, J. J., Stella, F., & Hall, G. (1997).Some numerical aspects of the training problem for feed-forward neural nets. Neural Networks, 10, 1455–1463. McLean, D., Bandar, Z., & O’Shea, J. D. (1998). An empirical comparison of back propagation and the RDSE algorithm on continuously valued real world data. Neural Networks, 11, 1685–1694. Moody, J. O., & Antsaklis, P. J. (1996). The dependence identication neural network construction algorithm. IEEE Transactions on Neural Networks, 7, 3– 15. Neal, R. M. (1996). Bayesian learning for neural networks. New York: Springer. Nocedal, J., & Wright, S. J. (1999). Numerical optimization. New York: Springer. Ormoneit, D. (1999). A regularization approach to continuous learning with an application to nancial derivatives pricing. Neural Networks, 12, 1405–1412. Orr, G. B., & Muller, ¨ K.-R. (Eds.). (1998). Neural networks: Tricks of trade. Berlin: Springer-Verlag. Osman, I. H., & Kelly, J. P. (Eds.) (1996). Meta-heuristics: Theory and applications. Norwell, MA: Kluwer. Park, D. C., El-Sharkawi, M. A., & Marks II, R. J. (1991). An adaptively trained neural network. IEEE Transactions on Neural Networks, 2, 334–345. Prechelt, L. (1998). Early stopping. In G. B. Orr & K.-R. Muller ¨ (Eds.), Neural networks: Tricks of the trade. Berlin: Springer-Verlag. Reed, R. D., & Marks, II, R. J. (1999). Neural smithing: Supervised learning in feedforward articial neural networks. Cambridge, MA: MIT Press. Rognvaldsson, ¨ T. S. (1998). A simple trick for estimating the weight decay parameter. In G. B. Orr & K.-R. Muller ¨ (Eds.), Neural networks: Tricks of the trade. Berlin: Springer-Verlag. Rohas, R. (1996). Neural networks: A systematic introduction. Berlin: SpringerVerlag. Saito, K., & Nakano, R. (2000). Second-order learning algorithm with squared penalty term. Neural Computation, 12, 709–729. Schaal, S., & Atkeson, C. G. (1998). Constructive incremental learning from only local information. Neural Computation, 10, 2047–2084. Scherzer, O., Engl, H. W., & Kunisch, K. (1993). Optimal a posteriori parameter choice for Tikhonov regularization for solving nonlinear ill-posed problems. SIAM Journal on Numerical Analysis, 30, 1796–1838. Tamura, S., & Tateishi, M. (1997).Capabilities of a four-layered feedforward neural network: Four layers versus three. IEEE Transactions on Neural Networks, 8, 251–255. Tetko, I. V., & Villa, A. E. P. (1997). Efcient partition of learning data sets for neural network training. Neural Networks, 10, 1361–1374. Thimm, G., & Fiesler, E. (1997). High-order and multilayer perceptron initialization. IEEE Transactions on Neural Networks, 8, 349–359. van de Laar, P., & Heskes, T. (1999). Pruning using parameter and neuronal metrics. Neural Computation, 11, 977–993.
1480
Tommi K¨arkk¨ainen
van der Smagt, P., & Hirzinger, G. (1998). Solving the ill-conditioning in neural network learning. In G. B. Orr & K.-R. Muller ¨ (Eds.), Neural networks: Tricks of the trade. Berlin: Springer-Verlag. Vitela, J. E., & Reifman, J. (1997). Premature saturation in backpropagation networks: Mechanism and necessary conditions. Neural Networks, 10, 721–735. Wang, G.-J., & Chen, C.-C. (1996). A fast multilayer neural-network training algorithm based on the layer-by-layer optimizing procedures. IEEE Transactions on Neural Networks, 7, 768–775. Wang, Y.-J., & Lin, C.-T. (1998).A second-order learning algorithm for multilayer networks based on block Hessian matrix. Neural Networks, 11, 1607–1622. Yaeger, L. S., Webb, B. J., & Lyon, R. F. (1998). Combining neural networks and context-driven search for on-line printed handwriting recognition in the Newton. In G. B. Orr & K.-R. Muller ¨ (Eds.), Neural networks: Tricks of the trade. Berlin: Springer-Verlag. Zhang, J., & Morris, A. J. (1998).A sequential learning approach for single hidden layer neural networks. Neural Networks, 11, 65–80. Received November 20, 2000; accepted October 12, 2001.
LETTER
Communicated by Dana Ron
Local Overtting Control via Leverages Ga e tan Monari
[email protected] Ecole SupÂerieure de Physique et de Chimie Industrielles de la Ville de Paris, Laboratoire d’Electronique, F 75005 Paris, France, and Usinor, DSI/DISA Sollac, F 13776 Fos-surMer Cedex, France G e rard Dreyfus
[email protected] Ecole SupÂerieure de Physique et de Chimie Industrielles de la Ville de Paris, Laboratoire d’Electronique, F15005, Paris, France We present a novel approach to dealing with overtting in black box models. It is based on the leverages of the samples, that is, on the inuence that each observation has on the parameters of the model. Since overtting is the consequence of the model specializing on specic data points during training, we present a selection method for nonlinear models based on the estimation of leverages and condence intervals. It allows both the selection among various models of equivalent complexities corresponding to different minima of the cost function (e.g., neural nets with the same number of hidden units) and the selection among models having different complexities (e.g., neural nets with different numbers of hidden units). A complete model selection methodology is derived. 1 Introduction
The traditional view of overtting refers mostly to the bias-variance tradeoff introduced in Geman, Bienenstock, and Doursat, (1992). A family of parameterized functions with too few parameters, with respect to the complexity of a problem, is said to have too large a bias, because it cannot t the deterministic model underlying the data. Conversely, when the model is overparameterized, the dependence of the resulting functions on the particular training set is too large, and so is the variance of the corresponding family of parameterized functions. Therefore, overtting is usually detected by the fact that the modeling error on a test set is much larger than the modeling error on the training data. In practice, there are two major ways of preventing overtting: A priori, by limiting the variance of the considered family of parameterized functions. These regularization methods include weight decay c 2002 Massachusetts Institute of Technology Neural Computation 14, 1481– 1506 (2002) °
1482
GaÂetan Monari and Ge rard Dreyfus
(see MacKay, 1992, for a Bayesian approach to weight decay) and similar cost function penalizations, as well as early stopping (Sjoberg ¨ & Ljung, 1992). None of them exempts the model designer from an additional estimation of the generalization performance of the selected model. A posteriori, by estimating the generalization performance on data that have not been used to t the model. This approach relies on data resampling and has given rise to the cross-validation methods (Stone, 1974), including leave-one-out. Statistical tests can be used after selecting candidate models, to test whether differences in the estimated performances are signicant (see, e.g., (Anders & Korn, 1999)). The limitations of the above methods are well known: weight decay and similar methods require the estimation of meta-parameters, and resampling methods tend to be computationally intensive. In this article, we consider overtting as a local phenomenon that occurs when the inuence of one (or more) particular examples on the model becomes too large because of the excessive exibility of the model. Therefore, we suggest a new approach to model selection that takes overtting into account on the basis of the leverages of the available samples, that is, on the inuence of each sample on the parameters of the model. In the next section, we recall briey the mathematical framework of this approach. In sections 3 and 4, we show that this perspective on overtting suggests a model selection method that takes into account both an estimation of the generalization error and the distribution of the inuences of the training examples. Section 3 is devoted to model selection within a given family of parameterized functions, and section 4 focuses on the choice of the appropriate family among several candidate families. Section 5 illustrates the method with several examples. 2 Mathematical Framework
This article discusses static single-output processes with a nonrandom ninput vector x D [x1 , . . . , xn ]T and an output yp that is considered as a measurement of a random variable Yp . We assume that an appropriate model can be written under the form Yp D r (x) C W,
(2.1)
where W is a zero-mean random variable and r (x) is the unknown regression function. A data set of N input-output pairs fx k, ypkg kD 1,...,N is assumed to be available for estimating the parameters of the model f (x, h ) . In the following, all vectors are column vectors, denoted by boldface letters (e.g., the n-vectors x and fxkg.
Local Overtting Control via Leverages
1483
In Monari and Dreyfus (2000), the effect of withdrawing an example from the training set was investigated. Assume that a model with parameters hLS has been found by minimizing the least-squares cost function computed on the training set. Under the sole assumption that the removal of an example from the training set has a small effect on hLS (in contrast to standard leaveone-out, no stability assumption is required in this approach, as discussed in section 3.2 and in appendix B), a second-order Taylor development of the model output with respect to the parameters was derived. It was shown that this generates an approximate model that is locally linear with respect to the parameters and whose variables are the components of the gradient of the model output with respect to the parameters. Introducing the Jacobian matrix @f ( xi , µ ) 1 N T i Z D [z , . . . , z ] , where z D , @µ
µ Dµ LS
the solution subspace can be dened (in analogy with linear models) as the subspace determined by the columns of matrix Z (assuming that the latter has full rank). Then the quantity hii D ziT ( ZT Z) ¡1 zi
(2.2)
is the ith component of the projection, on the solution subspace, of the unit vector along axis i. The quantities fhii giD1,...,N , termed leverages of the training examples, satisfy the following relations: 8i 2 [1, . . . .N] N X i D1
0 · hii · 1
hii D q,
(2.3) (2.4)
where q is the number of adjustable parameters of the model (e.g., the number of weights of a neural network and the number of monomials of a polynomial approximation). Equations 2.3 and 2.4 stem from the fact that the quantities fhii giD 1,...,N are the diagonal terms of the orthogonal projection matrix Z ( ZT Z) ¡1 ZT . If axis i is orthogonal to the solution subspace, all columns have their ith component equal to zero; hence, zi D 0 and hii D 0. Since the output of the model is the orthogonal projection of the process output onto the solution subspace, example i has essentially no inuence on the model (it has none in the case of a linear model). If axis i lies within the solution subspace, then hii D 1 and example i is learned almost perfectly (it is learned perfectly in the case of a linear model). Thus, hii is an indication of the inuence of example i on the model: the closer hii is to 1, the larger the inuence of example i is on the model. This will be further shown below. If all examples have the same inuence on the model, all leverages fhii giD 1,...,N equal Nq . In other words, each example uses up a fraction 1 / N
1484
GaÂetan Monari and Ge rard Dreyfus
of the q available degrees of freedom (adjustable parameters) of the model. If one considers that overtting results from an excessive inuence of one (or more) examples on a model, then model selection aims at obtaining the model that has the best performances and in which the inuences of all examples are roughly equal. The inuence of an example on the model should be reected in the effect of its withdrawal from the training set. If an example has a large inuence on the model ( hii ¼ 1), it should be very accurately learned when it is present in the training set, but it should be badly predicted otherwise; conversely, if an example has no inuence on the model ( hii ¼ 0) , it should be predicted with equal accuracy regardless of its presence or absence in the training set. Indeed, if we denote by Ri the residual of example i (i.e., the modeling error on example i when it is present in the training set: Ri D ypi ¡ f ( xi , hLS ) ) , an (¡ )
approximate expression of the prediction error Ri i that would occur if this example had been removed from the training set, is given by Ri(¡i) » D
Ri . 1 ¡ hii
(2.5)
Details of the derivation of this relation are given in Monari and Dreyfus (2000). This approximation is founded insofar as the second-order Taylor development of the output is valid (i.e., third-order terms are negligible). Hence, the difference between the predictions of example i, depending on whether it belongs or not to the training set, is a function of its leverage hii .1 This property will be taken advantage of in the selection method presented in the next sections. If the Jacobian matrix Z is not of rank q, that is, if the manifold of RN dened by the columns of Z is—at µLS —not of full rank, the corresponding models must be discarded. For such models, the number of available parameters is locally too large in view of the number of training examples, which leads to an underdetermination of these parameters. This is an extreme case of overtting that can be detected by computing the rank of Z directly (a difcult numerical task) or indirectly by checking the validity of relations 2.3 and 2.4. The latter solution supposes that we are able to compute the fhii giD 1,...,N , regardless of the fact that ZT Z is invertible or not. To address this problem, a singular value decomposition (SVD; see, e.g., Press, Teukolsky, Flannery, and Vetterling (1988)) of matrix Z can be performed, as shown in appendix A. SVD is very accurate and can always be performed 1
Actually, when hii approaches one, the residual Ri vanishes less rapidly than (1 ¡hii ). This can be understood as follows: if an example has a strong inuence on the model, its withdrawal from the training set causes its residual to be signicantly different. Therefore, Ri h ii D 1¡h from relation 2.5, the quantity Ri ¡ 1¡h R is not vanishingly small; hence, the ii ii i (¡i) » R i estimate of the prediction error for this example Ri does not vanish either. D 1¡h ii
Local Overtting Control via Leverages
1485
even if Z is singular. In the latter case, however, the leverages are not computed accurately, hence should not be used (e.g., for computing the quantity Ep dened in the next section) because they do not comply with their basic properties 2.3 and 2.4. Moreover, some estimates of the condence intervals on the model output (Seber & Wild, 1989) rely on the assumption that Z has full rank. Under this assumption, a condence interval on the model output for input x can be easily computed as N¡q
E (Yp | x) 2 f (x, µ LS ) § ta
p s zT ( ZT Z ) ¡1 z,
(2.6)
where
zD
@f (x, µ ) @µ
µ D µ LS
,
N¡q
ta
is the realization of a random variable with a Student’s t-distribution with N ¡q degrees of freedom and a level of signicance 1¡a, and s is an estimate of the residual standard deviation of the model: v u N u 1 X sDt R2 . N ¡ q iD 1 i This is an additional motive for rejecting models with rank deciency. All theoretical material presented in sections 3 and 4 will be illustrated graphically on a small problem with one input and one output. The training data were generated using the function y D sinx x and an additive gaussian noise of variance s 2 D 10¡2 . Fifty examples were drawn, with a uniform distribution of x within the interval [0; 15]. Throughout this article, this data set will be referred to as the sinx x problem. The application of our method to the selection of more complex, multivariable models will be demonstrated in section 5. 3 Selection of a Model Among Models Having the Same Architecture
For clarity, we consider that once the number of inputs and outputs is chosen, model selection is performed in two steps: 1. An architecture is chosen (i.e., a family of functions having the same complexity) within which the model is sought (e.g., the family NN 3 of neural networks with ve inputs, three hidden neurons, and a linear output neuron). If the model is nonlinear with respect to the parameters, several trainings with different parameter initializations are performed, thereby generating various models of the same architecture
1486
GaÂetan Monari and Ge rard Dreyfus
(e.g., for neural nets, a set of K models with three hidden neurons fNN3k , k D 1 to Kg) . For this family of functions, the most appropriopt
ate model (e.g., model NN3 in family NN 3 ) is selected, as described below.
2. The previous step is performed for different architectures (i.e., for different families of parameterized functions, having different complexities, such as families NN i of neural networks having the same number of inputs and outputs but different numbers i of hidden neuopt rons). This results in a set of models (e.g., models fNNi , i D 0 to Ig). The most appropriate model among these (e.g., model NN opt ) is selected as described in section 4. In this section, we rst propose a criterion for the selection of a model among models having the same complexity, and we subsequently compare our approach with the traditional use of the leave-one-out technique. The choice among models of different complexities is addressed in section 4. 3.1 A Selection Criterion. A preliminary screening of the models corresponding to the different minima must be performed by checking the rank of the corresponding Jacobian matrices. Models with rank deciency are overtted and should therefore be discarded.2 However, the fact that for a given architecture, the global minimum has rank deciency does not mean that the considered architecture is too large; local minima may provide very good models,3 provided they are not rank decient. Therefore, an additional selection criterion must be found. To this end, we use the results presented in the previous section. In the spirit of leave-one-out cross-validation, we dene the quantity Ep as
v u ´2 N ³ u1 X Ri Ep D t , N iD 1 1 ¡ hii
(3.1)
which is identical to the leave-one-out score except for the fact that the summation is performed on the estimated modeling errors given by relation 2.5, instead of being performed on the actual prediction error on each left-out example. This quantity can be compared to the traditional training mean 2 Moreover, a rank deciency of Z has an adverse effect on the efciency of some second-order training algorithm (for instance, the Levenberg-Marquardt algorithm). To overcome this, Zhou & Si (1998) suggested a training algorithm that guarantees that the Jacobian matrix is of full rank throughout training. 3 It is well known that a suboptimal model, whose parameters do not minimize the cost function, may have a smaller generalization error than the global minimum. This is the basis of the “early stopping” regularization method, whereby training is stopped before a minimum is reached.
Local Overtting Control via Leverages
1487
Figure 1: Distribution of the solutions obtained with networks having four hidden neurons. Models lying outside the frame of the graph are models with rank deciency, for which only TMSE can be computed. Note that such is the case for the model with the smallest value of the training cost function.
squared error (TMSE): v u N u1 X TMSE D t R2 . N iD 1 i
(3.2)
Ep is larger than TMSE; it penalizes models that are strongly inuenced by some examples. Therefore, this quantity is an appropriate choice criterion between models having the same architecture but corresponding to various local minima of the cost function. Note that if all examples have the same q N leverage N , then Ep D N¡q TMSE; hence, Ep and TMSE are equal in the large-sample limit for a model without overtting. To illustrate this, consider the case of a neural network with four sigmoidal hidden neurons and a linear output neuron, trained on the data set indicated in section 2. Starting from 500 different weight initializations and using the Levenberg-Marquardt algorithm leads to various minima, represented in Figure 1 as follows: the solutions without rank deciency are plotted in the (TMSE, Ep ) plane, using a logarithmic scale for the ordinates, and the solutions with rank deciency (rank (Z ) < q), for which Ep cannot be computed reliably (see section 2), are plotted outside the frame of the graph. The analysis of this plot leads to several comments: Even for this simple architecture, a very large number of different minima of the cost function are found.
1488
GaÂetan Monari and Ge rard Dreyfus
Figure 2: Outputs of models having four hidden neurons, selected on the basis of TMSE and Ep .
For this example, about 70% of the solutions have rank deciency, so that they may be discarded without further consideration. This includes the deepest minimum found, shown as a gray dot in Figure 1, which is likely to be a global minimum. For solutions without rank deciency, the ratio of Ep to TMSE varies from almost 1 to very high values, hence the choice of a logarithmic y-scale. This shows that some minima correspond to solutions that are overinuenced by a few training examples. As expected, this overtting is not apparent on TMSE, but it is on Ep . The solution with the smallest Ep is shown as a gray triangle. The model outputs corresponding to the minima having, respectively, the smallest TMSE and the smallest Ep are shown in Figure 2. Note that it is not claimed that the model with the smallest Ep is not overtted. It is claimed that among the various minima found with the weight initializations used, it is the model that for the considered architecture provides the best trade-off between accuracy and uniform inuence of the training examples. This point will be addressed further in the next section. Starting from a linear model and increasing gradually the number of hidden neurons, one obtains the results shown in Figure 3. It appears that Ep is essentially constant between two and four neurons. Therefore, an additional step is required to perform the nal model selection, that is, to choose the appropriate number of hidden neurons; this is addressed in section 4. In this particular case where the noise level is known, it may be inferred that models with more than three hidden neurons are probably overtted since their TMSE is smaller than the standard deviation of the noise.
Local Overtting Control via Leverages
1489
Figure 3: Evolution of the optimal Ep and of the corresponding training mean square error as a function of the number of hidden neurons. s is the standard deviation of the noise.
The question that arises naturally is whether Ep (or the standard leaveone-out score Et as dened below) is a good estimate of the true generalization error of the model. Under the hypothesis of error stability introduced in his article, sanity-check bounds for the leave-one-out error have been derived by Kearns and Ron, (1997). Obtaining narrower bounds would require some additional stability properties of the training algorithm or cost function. 3.2 Comparison with the Standard Leave-One-Out Procedure. Since our approach relies on the rst principles of the leave-one-out procedure, a comparison to the original procedure is of interest. For models that are nonlinear with respect to their parameters, the most popular version of the leave-one-out, called generalized leave-one-out (Moody, 1994), consists of rst nding a minimum of the cost function, with weights W, through training with the whole training set. The N subsequent trainings with one left-out example start with the set of initial weights W. Then, using the resulting N prediction errors on left-out examples, an estimation of the generalization error of the model is computed as v u N u1 X (¡ ) ( Ri i ) 2 . (3.3) Et D t N iD 1
This method assumes that the withdrawal of one example from the training set does not result in a jump from one minimum of the cost function to another one. We formalize this as follows (see appendix B for more details): consider a training procedure using a deterministic minimization algorithm;4 assume that a minimum of the cost function has been found 4 This excludes such algorithms as simulated annealing and probabilistic hill climbing, which may overcome local minima.
1490
GaÂetan Monari and Ge rard Dreyfus
Figure 4: Application of relation 2.5 to a neural model of sin ( x) / x with two hidden neurons, whose output is shown in Figure 7.
with a vector of parameters µ LS . Assume that example i is withdrawn from the training set and that a training is performed with µ LS as initial values (¡ ) of the parameters, leading to a new parameter vector µ LS i . Further assume that example i is subsequently reintroduced into the training set and that a (¡ ) new training is performed with initial values of the parameters µ LSi . In the following, the minimum of the cost function corresponding to a parameter vector µ LS will be said to be stable for the usual leave-one-out procedure if and only if, for each i 2 [1, . . . , N], starting from µ LS , the procedure described above retrieves µLS . It has been known since Breiman (1996) that this is a major stability problem. In practice, it turns out that if an overparameterized model is used, most minima of the cost function are unstable with respect to the leave-one-out procedure. Therefore, for all such minima, the computation of the leave-one-out score Et and its comparison to Ep are meaningless. For minima that are stable with respect to the leave-one-out procedure, the validity of approximating Et by Ep depends on the validity of the Taylor expansions used to derive relation 2.5. Checking this validity can be performed by estimating the curvature of the cost function (see Antoniadis, Berruyer, & Carmona, 1992); alternatively, one can perform the leave-oneout procedure for a model selected on the basis of Ep and compare Et and Ep . This is illustrated in Figure 4, for an approximation of sinx x with two hidden neurons, for the minima that are stable and with full rank. The time required for the computation of the quantities fhii g from relation 2.2 and of the leave-one-out score Ep at the end of each training is
Local Overtting Control via Leverages
1491
negligibly small as compared to the time required for training; therefore, our procedure divides the computation time by N, as compared to the standard leave-one-out, where N is the number of examples. 3.3 Conclusion. We have shown in this section that Ep is an efcient basis for selecting a model within a family of parameterized functions; furthermore, it automatically eliminates the solutions that are rank decient. Once a model has been selected on this basis for several families of parameterized functions, one has to select the appropriate architecture. This is addressed in the next section. 4 Selection of the Optimal Architecture
Having selected, for each architecture (e.g., for each number of hidden neurons), the minimum with the smallest value of Ep , we have to dene a procedure for choosing the best architecture. It is known that among models with approximately the same performances, one should favor the model with the smallest architecture. In practice, however, the trade-off between parsimony and performance may be difcult to perform. Referring to Figure 3, the choice between two, three, or four hidden neurons is not easy in the absence of further information. In this section, we show that the leverage may be used as an additional element in the process of choosing a model among candidate models having different architectures. 4.1 Leverage Distribution. In a purely black box modeling context, all data points of the training set are equally important; therefore, we suggest that among models with approximately the same generalization error estimates (Ep ), one should select the model for which the inuence of the training examples is most evenly shared, 5 that is, for which the distribuq tion of the leverages hii is narrowest around N . For example, the models with two and four hidden neurons of Figure 3 have signicantly different leverage distributions, as shown in Figure 5. As expected from relation 2.3, q the leverages fhii giD 1,...,N are centered on the corresponding values of N . However, the distribution is broader for the model with four hidden neurons. The width of the distribution is due to the fact that for x 2 [8I 13], the number of examples is too small given the number of parameters of the model; this results in condence intervals on the model output that are signicantly larger in the corresponding region of input space than elsewhere. In order to characterize the leverage distribution p of a given model, it is appropriate to use the mean value of the quantities f hii giD 1,...,N , since the latter are involved in the computation of the condence intervals. Indeed,
5 Unless it is explicitly desired, for some reason arising from prior knowledge on the modeled process, that one or more example be of special importance.
1492
GaÂetan Monari and Ge rard Dreyfus
Figure 5: Model outputs (top and middle) and leverage distributions (bottom) for the models with two and four hidden neurons selected on the basis of Ep .
Local Overtting Control via Leverages
1493
Figure 6: Performance indices of the solutions selected on the basis of Ep , including m (right scale), and tested on a separate data set.
the application of relation 2.6 to example i of the training set yields p N¡q E (Yp | xi ) 2 f (xi , µ LS ) § ta s hii .
(4.1)
Therefore, we dene the quantity N 1 X m D N iD 1
s N hii . q
(4.2)
Using Cauchy’s inequality and relations 2.3 and 2.4, one can derive the following properties for m , regardless of N and q: ( m ·1
¡ m D 1 , hii D
q¢ 8i N
2 [1, . . . , N]
(4.3)
Therefore, m is a normalized parameter that characterizes the distribution of the leverages around their mean value. Regardless of N and q, the closer m is to 1, the more evenly shared the inuences are among the training examples. Hence, m can be used as an additional indication of the overtting of a model. To illustrate this, Figure 6 shows, in addition to the curves of Figure 3, the values of m and an estimate of the generalization error, obtained on a separate representative test set (100 examples). To this end, we dene the generalization mean square error (GMSE) as s GMSE D
X 1 ( yp | x ¡ f (x, µLS ) ) 2 . card(Test set) xbTest set
(4.4)
1494
GaÂetan Monari and Ge rard Dreyfus
The analysis of Figure 6 leads to the following comments: The distinction between leverage distribution of the two models depicted in Figure 5 can actually be performed by m (0.98 versus 0.96) . The model with the highest value of m is actually the model with the smallest value of GMSE. For larger architectures, both m and GMSE get worse. Models with more than two hidden neurons do not reach the performance of that with two hidden neurons. However, the increase of the GMSE turns out to be limited, which demonstrates that selecting the minima on the basis of Ep is an effective means of limiting overtting. As a conclusion, considering several architectures with approximately the same values of Ep , the most desirable solution is the solution with the highest value of m in the context of black box modeling, that is, if there is no reason, arising from prior knowledge, to devote a large fraction of the available degrees of freedom to one or to a few specic training samples. 4.2 Model Selection and Extension of the Training Set. In the previous section, we considered the situation where one wishes to select a model built from a data set that is available once and for all. It is often the case, however, that additional samples may be gathered in order to improve the model. Then the following question must be addressed: In what regions of input space would it be useful to gather new data? We show in the following that the approach to model selection described in the previous section can be helpful in this respect. To this end, one has to dene a maximal acceptable value for the condence interval. Using the expression of that interval (relation 2.6), one can choose, for instance, the following threshold: N¡q
ta
p N¡q s zT (zT z) ¡1 z < ta s.
(4.5)
This guarantees that the condence on the model output f (x, µLS ) is not poorer than that on the measured output Yp | x. To start, consider the model with two hidden neurons, which has been selected, using m , as the best possible model given the available training set (see Figure 6). According to the threshold 4.5, this model should not be used outside the interval x 2 [¡0.3, 16.5] (see Figure 7). In order to use the model outside this input space area, additional data should be gathered. This is not surprising since the interval [¡0.3, 16.5] is roughly the interval from which the training set was drawn. However, if it is desirable to improve the model within this interval, the condence interval on the model with two hidden neurons does not provide any information for detecting where additional examples would be required. To that effect, we have to choose slightly more overtted models,
Local Overtting Control via Leverages
1495
Figure 7: Application of threshold 4.5 to the condence interval on the model output (two hidden neurons).
Figure 8: Application of threshold 4.5 to the condence interval on the model output (four hidden neurons).
such as the model with four hidden neurons shown in Figure 5, and consider their condence intervals. Then the application of threshold 4.5 shows areas of input space, both within and outside the interval x 2 [¡0.3, 16.5], where additional examples would be helpful (see Figure 8): the large condence interval between 9 and 11 shows that the corresponding hii are large—that a large fraction of the degrees of freedom of the four-hidden-neuron network was used to t the training examples within this interval. Therefore, gathering new data in this area would indicate whether the wiggle of the output in this area is signicant or whether it is an artifact due to the small number of samples. Hence, considering slightly oversize models for selection is a means for detecting areas of input space where gathering additional data would be appropriate. Once additional examples are gathered, the selection procedure is performed again.
1496
GaÂetan Monari and Ge rard Dreyfus
Figure 9: Performances of the candidate models with one to ve hidden neurons.
4.3 Discussion. For didactic reasons, the selection method discussed in sections three and four was split into two phases. For a given training set, architectures of increasing sizes were considered:
1. For each architecture, perform several trainings, starting with various random weights. 2. For each model, compute the leverages fhii giD 1,...,N of the training examples and check the validity of relations 2.2 and 2.3. Discard the model in case of rank deciency; otherwise, compute Ep . 3. Keep the model with the smallest value of Ep , and compute its parameter m . 4. Among the candidate models exhibiting a satisfactory Ep , select the model with the highest value of m . Thus, the method relies essentially on two quantities that can be compared for a given problem, regardless of the model size: Ep , which is an estimate of the generalization error, and m , an index of the degree of overtting of the model. Hence, one should consider, for various architectures, all candidate models as a pair ( Ep , m ) , and perform model selection directly on the basis of the position of each model in the Ep ¡ m plane. For instance, on the sinx x problem, the selection should be performed by considering Figure 9, which represents about 1000 candidate models for ve different architectures. According to the required trade-off between Ep and m , the nal model should be selected within the indicated area. Models outside this area should not be chosen, for one can always select a better
Local Overtting Control via Leverages
1497
one (i.e., with a smaller Ep and a higher m ). Therefore, architectures with one and ve hidden neurons can denitely be discarded. As explained in section 4.2, the model designer should perform his nal choice depending on his objectives. If he wants the best possible model given this particular training set, he should favor, within this dotted area, the solutions with the highest possible value of m . Conversely, if he has the opportunity of gathering additional training examples and wants to select the most relevant ones, he should favor slightly overtted models. Within this dotted area, he should favor Ep . The condence intervals for the prediction of such a model indicate the areas, in input space, where gathering data would be desirable. With this new data, the selection procedure may be performed again, which may lead to less complex models. 5 Validation on Numerical Examples
In this section, we illustrate the use of our method on two academic examples. We demonstrate on a teacher-student problem that the method we propose gives very good results, in contrast to the usual leave-one-out approach. For comparison purposes, we model a simulated process investigated previously in Anders and Korn (1999), and we outline briey an industrial application. 5.1 Comparison to the Standard Leave-One-Out Approach on a Teacher-Student Problem. We consider the following problem: a database
of 800 examples is generated by a neural network with ve inputs, one hidden layer of ve neurons with sigmoidal (tan h) nonlinearity, and a linear output neuron: 0 1 5 5 X X j ai tanh @b i0 C bi xj A . (5.1) E (Yp | x) D a0 C iD 1
jD 1
Thus, we guarantee that the family of parameterized functions, in which we search for the model, actually contains the regression function. The weights and the inputs of this teacher network are uniformly distributed in [¡1, C 1] and [¡3, C 3] respectively, in order to use the sigmoids in their nonlinear zone. A gaussian noise is added to the output, with a standard deviation equal to 20% of the unconditional standard deviation of the model output: s 2 D 0.05. Student networks with ve inputs are investigated. We show how all the above results can be successfully used to perform model selection. To this end, 300 examples are used to perform both training and model selection as described in the previous sections, and 500 examples are devoted to the tests. All results reported below were obtained by estimating the gradient of the cost functions by backpropagation, and minimizing the cost functions by the BFGS algorithm (Press et al., 1988). The values of fhii giD 1,...,N are
1498
GaÂetan Monari and Ge rard Dreyfus
Figure 10: Performances of the solutions selected on the basis of Ep , including m (right scale), and tested on a separate data set (teacher-student problem).
computed by singular value decomposition of matrix Z, as presented in appendix A. For increasing architecture sizes (from one to nine hidden neurons), the minimum of the cost function with the smallest value of Ep was selected, and the following values were computed: TMSE and m on the training set and GMSE on the separate test set. These results are summarized in Figure 10. As on the sinx x problem, Ep decreases monotonically and reaches approximately the value of the standard deviation of the noise. Based on the values of m , the optimal solution appears to be the model having ve hidden neurons. Indeed, this proves to be the model for which the generalization error GMSE is smallest. Remarkably, the weights of this network are almost identical to those of the teacher network (they should not be expected to be strictly identical, since noise is added to the output of the teacher network during generation of the data). Actually, the weights of this network are identical to those obtained when training is performed with initial weights equal to the teacher ’s weights: this demonstrates that the selection method nds the best possible model. By contrast, the generalized leave-one-out approach appears to give very poor results: If leave-one-out is performed on the minima with the smallest value of TMSE, without checking Z’s rank nor the stability of the minima for the usual leave-one-out, the error Et decreases as hidden neurons are added and becomes signicantly smaller than the standard deviation of the noise. Indeed, the global minima of the cost functions corresponding to architectures having ve (i.e., the teacher network architecture) to nine hidden neurons have a rank deciency and are therefore overtted solutions with poor GMSE.
Local Overtting Control via Leverages
1499
If leave-one-out is performed on the minima with the smallest value of TMSE among the minima without rank deciency, without checking that the minima are stable for the usual leave-one-out, the error Et reaches a plateau close to the standard deviation of the noise. However, both Ep and GMSE prove that the performances of these models are worse than expected from E t . In fact, the corresponding minima appear to be unstable for the usual leave-one-out, which makes Et invalid, If the procedure is restricted to the minima with the smallest value of TMSE among the minima without rank deciency and stable for the usual leave-one-out, almost all minima, from ve to nine hidden neurons, are rejected. Furthermore, the procedure is extremely lengthy. This exemplies the limitations of the conventional generalized leave-oneout. When the number of training examples is small given the complexity of the family of parameterized functions, the withdrawal of some training examples makes the minima of the cost function unstable for the usual leave-one-out. By contrast, performing leave-one-out on the basis of relation 2.5 prevents these stability problems. Furthermore, the computational overhead of this selection method is negligible, since we only have to compute the fhii giD 1,...,N for each minimum of the cost function. 5.2 Comparison to Other Selection Methods on a Benchmark Problem.
Anders and Korn (1999) introduce a two-input process, simulated with the following regression function: E (Yp | x) D ¡0.5 C 0.2x21 ¡ 0.1 exp(x2 ) .
(5.2)
The inputs x1 and x2 are drawn from a normal distribution. Gaussian noise is added to the output, with a standard deviation equal to 20% of the unconditional standard deviation of the model output: s 2 D 5 10¡3 . To perform statistically signicant comparisons, 1000 training sets of 500 examples each were generated with different realizations of the noise (the inputs remaining unchanged). A separate set of 500 samples was used for computing the GMSE. The following performance index was used: rD
GMSE2 ¡ s 2 £ 100%. s2
(5.3)
Using several model selection techniques (hypothesis testing, information criteria, and 10-fold cross-validation, each of them being followed, if necessary, by network pruning), the best performance (averaged over the 1000 training sets), reported in Anders and Korn (1999), was r D 28% (standard deviation not indicated). The authors state that these (relatively) poor performances arise from the complexity of the true regression function, equation 5.2.
1500
GaÂetan Monari and Ge rard Dreyfus
Table 1: Summary of Numerical Results on a Benchmark Problem. LOCL Method Anders and Korn (1999) 500 training samples 100 training samples
Average r S.D. Average r S.D.
28% Not indicated
Outliers Not Outliers Removed Removed 3% 2% 126% 632%
3%a 2% 27%b 28%
Weight Decay 2% 2% 54% 36%
Notes: a No outlier detected. b 3% outliers detected.
Under the same conditions, we performed model selection according to the present method (on the basis of Ep and m as described in sections 3 and 4) and the Bayesian approach to weight decay6 described in MacKay (1992). In both cases, the performance indices were very satisfactory. The results reported in this section are summarized in Table 1, where the present method is referred to as LOCL (local overtting control via leverages). Hence, this academic problem can be accurately and consistently solved by our method as well as by weight decay. As Rivals and Personnaz (2000) proposed, a reduction of the size of the training set from 500 to 100 samples is of interest. Unlike them, we simulated the 1000 new training sets with different values of the noise and of the inputs. Because of the decrease of the training set size, this was necessary to obtain statistically signicant results. 7 Our method gave a value of r D 126%, whereas using weight decay,8 one obtains r D 54%. This leads to the following comments: In a problem with such a small number of training examples, knowing the true noise level (see note 6) is a tremendous advantage, which explains the superiority of our implementation of weight decay. The standard deviation associated to our r shows the large scattering of the results. However, advantage can be taken from the fact that our method is local, whereas alternative methods rely on global scores. We can show that the 6 Due to the difculties encountered while implementing this approach, we made the assumption that the noise level (meta-parameter b in the cost function) was known and set it to its true value. Without this strong assumption, the performances of the models selected on the basis of the “evidence” would certainly have been worse. 7 Otherwise, the results depend very strongly on the location of the training examples in input space. 8 Keeping the inputs unchanged and using only Ep followed, if necessary, by network pruning, Rivals and Personnaz (2000) obtained r D 140% (S.D. not given).
Local Overtting Control via Leverages
1501
Figure 11: The welding process. Two electrodes are pressed against the metal sheets, and a strong electrical current is own through the assembly.
poor results reported above are due to only a few points of the test set. Just as in section 4.2, one can dene a threshold that the condence interval on the model output should not exceed. Choosing threshold 4.5, one obtains the following results: the condence intervals of 3% (S.D. 2%) of the test examples were found to be above threshold; these example were discarded; on the remaining 97% of the examples, an average model performance r D 27% was obtained. This is comparable to the performances obtained by Anders and Korn (1999) using a training set that was ve times as large. 5.3 An Industrial Application. This method was used in a large-scale industrial application: the modeling of the spot welding process. The quantity to be predicted was the diameter of the weld as a function of physical parameters measured during the process. Because of the relatively small number of examples at the beginning of the project and the economic and safety issues involved, model selection was a critical point. Spot welding is the most widely used welding process in the car industry; millions of such welds are made every day. Two steel sheets are welded together by passing a very large current (tens of kiloamperes) between two electrodes pressed against the metal surfaces, typically for a hundred milliseconds (see Figure 11). The heating thus produced melts a roughly cylindrical region of the metal sheets. After cooling, the diameter of the melted zone—typically 5 mm—characterizes the effectiveness of the process. A weld spot whose diameter is smaller than 4 mm is considered mechanically unreliable; therefore, the spot diameter is a crucial element in the safety of a vehicle. At present, no fast, nondestructive method exists for measuring the spot diameter, so there is no way of assessing the quality of the weld immediately after welding. Therefore, a typical industrial strategy consists of using an intensity that is much larger than actually necessary, which re-
1502
GaÂetan Monari and Ge rard Dreyfus
sults in excessive heating, which in turn leads to the ejection of steel droplets from the welded zone (hence the “sparks” that can be observed during each welding by robots on automobile assembly chains), and in making a much larger number of welds than necessary to be sure that a sufcient number of valid spots are produced. Both the excessive current and the excessive number of spots result in a fast degradation of the electrodes, which must be changed or redressed frequently. For all these reasons, the modeling of the process, leading to a reliable on-line prediction of the weld diameter, is an important industrial challenge. Modeling the dynamics of the welding process from rst principles is a very difcult task, because the computation time necessary for the integration of the partial differential equations of the knowledge-based model is many orders of magnitude larger than the duration of the process (which precludes real-time prediction of the spot diameter), and because many physical parameters appearing in the equations are not known reliably. These considerations led to considering black box modeling as an alternative. Since the process is nonlinear and has several input variables, neural networks were natural candidates. The main goal was to predict the spot diameter from measurements performed during the process, immediately after weld formation, for on-line quality control. The main concerns for the modeling task were the choice of the model inputs and the limited number of examples available initially in the database. The quantities that are candidates for input selection are mechanical and electrical signals that can be measured during the welding process. Input selection was performed by forward stepwise selection of polynomial models, whereby the signicance of adding a new input or removing a previously selected input is tested by statistical tests. The variables thus selected were subsequently used as inputs to neural networks. The experts involved in the knowledge-based modeling of the process validated this set. Because no simple nondestructive weld diameter measurement exists, the database is built by performing a number of welds in well-dened condition and subsequently tearing them off. The melted zone, remaining on one of the two sheets, is measured. This is a lengthy and costly process, so that the initial training set was made of 250 examples. Using the condence interval estimates 2.6 and the training set extension strategy discussed in section 4.2, further experiments were performed, so that, nally, a much larger database became available, half of which was used for training and half for testing (since the selection method does not require any validation set). Model selection was performed using the full procedure discussed in section 4. Typical results are displayed in Figure 12, which shows the scatter plots for spot diameter prediction on the training set and on the completely independent test set (for a certain type of steel sheet), together with the condence intervals. The leave-one-out score Ep is equal to 0.27, while the EQMT is 0.23. The full description of this application is beyond the scope of this paper; noncondential data can be found in Monari (1999).
Local Overtting Control via Leverages
1503
Figure 12: Scatter plots with condence intervals for diameter prediction. (Left) Training set. (Right) Test set.
6 Conclusion
A model selection method relying on local overtting control via leverages, termed the LOCL method, has been presented. It is based on computing the leverage of each example on the candidate models. The leverage of a given example can be regarded as an indication of the percentage of the degrees of freedom of the model that is used to t the example. This allows a precise monitoring of overtting, since the method relies on local indicators in sample space (the leverages) instead of on global indicators such as a cross-validation score or the value of a penalty function. Although it is similar in spirit to the generalized leave-one-out procedure, it is very economical in computation time, and it avoids the stability problems inherent in leave-one-out. Furthermore, the values of the leverages (or of the condence interval computed therefrom) give an indication of areas, in input space, where new examples are needed. This method has been validated in a large-scale industrial application. Appendix A: Computation of the Leverages
This appendix shows how the values of the all-important quantities fhii giD 1,...,N can be computed without matrix inversion by making use of the singular value decomposition (see, e.g., Press et al., 1988) of matrix Z, which can always be performed, even if matrix Z is singular. Z can be written as: Z D UWV T ,
(A.1)
1504
GaÂetan Monari and Ge rard Dreyfus
where U is a ( N, q) column orthogonal matrix (i.e., U T U D I), W is a ( q, q) diagonal matrix with positive or zero elements (the singular values of Z), and V is a (q, q) orthogonal matrix (i.e. VT V D VV T D I). We thus obtain ( ZT Z ) ¡1 D ( VWU T UWV T ) ¡1 D ( VW2 V T ) ¡1 D VW ¡2 VT .
(A.2)
Therefore, the elements of this ( q, q) matrix can easily be computed as ( ZT Z ) lj¡1 D
q X VlkVjk kD 1
2 W kk
.
(A.3)
From equation A.3 and hii D ziT ( ZT Z) ¡1 zi , one gets 12 q 1 X @ hii D Zij VjkA . W kk jD1 kD 1 q X
0
(A.4)
Appendix B: Stability of a Model for the Usual Leave-One-Out Procedure
Consider a training procedure using a deterministic minimization algorithm of the cost function;9 assume that a minimum of the cost function has been found, with a vector of parameters µ LS . Assume that example i is withdrawn from the training set and that a training is performed with µLS as initial (¡ ) values of the parameters, leading to a new parameter vector µLS i . Further assume that example i is subsequently reintroduced into the training set and that a new training is performed with initial values of the parameters (¡ ) µLSi : The minimum of the cost function corresponding to a parameter vector µLS is said to be stable for the usual leave-one-out procedure if and only if, for each i 2 [1, . . . , N], starting from µLS , the procedure described above retrieves µ LS . This denition is illustrated in Figure 13. On this gure, in the (¡ ) unstable case, after the removal of example i (and convergence to µ LSi ) , the replacement of the same example into the training set makes the learning procedure converge to another solution with parameters µ¤LS . In such a case, Et is not an estimate of the generalization performance of the model with (¡ ) parameters µ LS , since the prediction error Ri i actually corresponds to the ¤ model with parameters µLS . This denition is different from the stability considerations introduced by other authors in order to derive bounds on the estimate of the generalization performance (see Vapnik, 1982). For instance, bounding the difference 9 This excludes such algorithms as simulated annealing and probabilistic hill climbing, which may overcome local minima.
Local Overtting Control via Leverages
1505
Figure 13: Minima that are stable (left) and unstable (right) for the usual leaveone-out procedure.
between the true error and the leave-one-out estimate thereof (with an arbitrary accuracy at a given level of signicance) requires some stability of the training algorithm regardless of the available data set (Kearns and Ron, 1997). Therefore, this stability does not depend on the considered minimum of the cost function. Our denition of stability is less stringent than the alternative one. It should be remembered that, in contrast to the usual leave-one-out, the validity of our approach does not depend on any stability condition. References Anders, U., & Korn, O. (1999). Model selection in neural networks. Neural Networks, 12, 309–323. Antoniadis, A., Berruyer, J., & Carmona, R. (1992). RÂegression non linÂeaire et applications. Paris: Economica. Breiman, L. (1996). Heuristics of instability and stabilization in model selection. Annals of Statistics, 24, 2350–2383. Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4, 1–58. Kearns, M., & Ron, D. (1997). Algorithmic stability and sanity-check bounds for leave-one-out cross validation. In Tenth Annual Conference on Computational Learning Theory, (pp. 152–162). New York: Association for Computing Machinery Press. MacKay, D. J. C. (1992). A practical Bayesian framework for backpropagation networks. Neural Computation, 4, 448–472. Monari, G. (1999). SÂelection de mod`eles non line aires par leave-one-out: Etude thÂeorique et application des rÂeseaux de neurones au procÂedÂe de soudage par points. The` se de l’Universit e Paris 6. Available on-line: http://www.neurones.espci.fr/Francais.Docs/publications.html. Monari, G., & Dreyfus, G. (2000). Withdrawing an example from the training set: An analytic estimation of its effect on a nonlinear parameterized model. Neurocomputing Letters, 35, 195–201.
1506
GaÂetan Monari and Ge rard Dreyfus
Moody, J. (1994). Prediction risk and neural network architecture selection. In V. Cherkassky, J. H. Friedman, & H. Wechsler (eds.), Statistics to neural networks: Theory and pattern recognition applications. Berlin: Springer-Verlag. Press, W. H., Teukolsky, S. A., Flannery, B. P., & Vetterling, W. T. (1988).Numerical recipes in C: The art of scientic computing. Cambridge: Cambridge University Press. Rivals, I., & Personnaz, L. (2000). A statistical procedure for determining the optimal number of hidden neurons of a neural model. Paper presented at the ICSC Symposium on Neural Computation, Berlin. Seber, G. A. F., & Wild, C. J. (1989). Nonlinear regression. New York: Wiley. Sjoberg, ¨ J., & Ljung, L. (1992). Overtraining, regularization and searching for minimum in neural networks (Tech. Rep. LiTH-ISY-I-1297). Linkoping: Department of Electrical Engineering. Linkoping University. Available on-line: http://www.control.isy.liu.se. Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society B, 36, 111–147. Vapnik V. N. (1982). Estimation of dependences based on empirical data. New York: Springer-Verlag. Zhou, G., & Si, J. (1998). A systematic and effective supervised learning mechanism based on Jacobian rank deciency. Neural Computation, 10, 1031–1045. Received October 16, 2000; accepted October 19, 2001.
ARTICLE
Communicated by Stuart Geman
A Monte Carlo EM Approach for Partially Observable Diffusion Processes: Theory and Applications to Neural Networks Javier R. Movellan
[email protected] Machine Perception Laboratory, Institute for Neural Computation, University of California, San Diego, La Jolla, CA 92093, U.S.A.
Paul Mineiro
[email protected] Department of Cognitive Science, University of California, San Diego, La Jolla, CA 92093, U.S.A.
R. J. Williams
[email protected] Department of Mathematics and Institute for Neural Computation, University of California, San Diego, La Jolla, CA 92093, U.S.A.
We present a Monte Carlo approach for training partially observable diffusion processes. We apply the approach to diffusion networks, a stochastic version of continuous recurrent neural networks. The approach is aimed at learning probability distributions of continuous paths, not just expected values. Interestingly, the relevant activation statistics used by the learning rule presented here are inner products in the Hilbert space of square integrable functions. These inner products can be computed using Hebbian operations and do not require backpropagation of error signals. Moreover, standard kernel methods could potentially be applied to compute such inner products. We propose that the main reason that recurrent neural networks have not worked well in engineering applications (e.g., speech recognition) is that they implicitly rely on a very simplistic likelihood model. The diffusion network approach proposed here is much richer and may open new avenues for applications of recurrent neural networks. We present some analysis and simulations to support this view. Very encouraging results were obtained on a visual speech recognition task in which neural networks outperformed hidden Markov models. 1 Introduction Since Hopeld’s seminal work (Hopeld, 1984), continuous deterministic neural networks and discrete stochastic neural networks have been thorc 2002 Massachusetts Institute of Technology Neural Computation 14, 1507–1544 (2002) °
1508
Javier R. Movellan, Paul Mineiro, and R. J. Williams
oughly studied by the neural network community (Pearlmutter, 1995; Ackley, Hinton, & Sejnowski, 1985). However, the continuous stochastic case has been conspicuously ignored. This is surprising considering the success of continuous stochastic models in other elds (Oksendal, 1998). In this article, we focus on the continuous stochastic case and present a Monte Carlo expectation-maximization (EM) approach for training continuous-time, continuous-state, stochastic recurrent neural network models. The goal is to learn probability distributions of continuous paths, not just equilibrium points. This is important for problems involving sequences, such as speech recognition and object tracking. The approach proposed here potentially opens new avenues for applications of recurrent neural networks showing results comparable to, if not better than, those obtained with hidden Markov models. In addition, the maximum likelihood learning rules used to train these networks are based on inner products that are computable using local Hebbian statistics. This is an aspect of potential value for neurally plausible learning models and for potential generalizations of kernel methods (Aizerman, Braverman, & Rozoner, 1964; Burges, 1998). Continuous-time, continuous-state recurrent neural networks (hereafter referred to simply as recurrent neural networks) are dynamical systems consisting of n point neurons coupled by synaptic connections. The strength of these connections is represented by an n £ n real-valued matrix w, and the network dynamics are governed by the following differential equations, dxj ( t) D m j ( x ( t) , l) , for t 2 [0, T], j D 1, . . . , n, dt
(1.1)
where Á m j ( x ( t) , l) D kj ¡rj xj ( t) C jj C
n X
!
Q (xi ( t) ) wij ,
(1.2)
iD 1
xj is the soma potential of neuron j, 1 / rj > 0 is the transmembrane resistance,1 1 /kj > 0 is the input capacitance, jj is a bias current, wij is the conductance (synaptic weight) from unit i to unit j, and Q is a nonlinear activation function, typically a scaled version of the logistic function
Q (v ) D h1 C h2
1 , for v 2 R, 1 C e¡h3 v
(1.3)
1 We allow 1 / rj or 1 /kj to be C 1, corresponding to situations in which rj D 0 or kj D 0, respectively.
A Monte Carlo EM Approach
1509
where h D (h1 , h2 , h3 ) 2 R3 are xed scale and gain parameters. Here, l 2 R p represents the w, j , k , and r terms, whose values are typically varied in accordance with some learning rules. 1.1 Hidden Units. A variety of algorithms have been developed to train these networks, and are commonly known as recurrent neural network algorithms (Pearlmutter, 1995). An important achievement of these algorithms is that they can train networks with hidden units. Hidden units allow these networks to develop time-delayed representations and feature conjunctions. For this reason, recurrent neural networks were expected to become a standard tool for problems involving continuous sequences (e.g., speech recognition) in the same way that backpropagation networks became a standard tool for problems involving static patterns. Recurrent network learning algorithms have proved useful in some elds. In particular, they have been useful for understanding the role of neurons in the brain. For example, when these networks are trained on simple sequences used in controlled experiments with animal subjects, the hidden units act as “memory” neurons similar to those found in neural recordings (Zipser, Kehoe, Littlewort, & Fuster, 1993). While these results have been useful to help understand the brain, recurrent neural networks have yielded disappointing results when applied to engineering problems such as speech recognition or object tracking. In such domains, probabilistic approaches, such as hidden Markov models and Kalman lters, have proved to be superior. 1.2 Diffusion Networks. We believe that the main reason that recurrent neural networks have provided disappointing results in some engineering applications (e.g., speech recognition) is due to the fact that they implicitly rely on a very simplistic likelihood model that does not capture the kind of variability found in natural signals. We will elaborate on this point in section 8.1. To overcome this deciency, we propose adding noise to the standard recurrent neural networks dynamics, as would be done in the stochastic ltering and systems identication literature (Lewis, 1986; Ljung, 1999). Mathematically, this results in a diffusion process, and thus we call these models diffusion neural networks, or diffusion networks for short (Movellan & McClelland, 1993). While recurrent neural networks are dened by ordinary differential equations (ODEs), diffusion networks are described by stochastic differential equations (SDEs). Stochastic differential equations provide a rich language for expressing probabilistic temporal dynamics and have proved useful in formulating continuous-time inference problems, as, for example, in the continuous Kalman-Bucy lter (Kalman & Bucy, 1961; Oksendal, 1998). Diffusion networks can be interpreted as a low-power version of recurrent networks, in which the thermal noise component is nonnegligible. The temporal evolution of a diffusion network with vector parameter l denes
1510
Javier R. Movellan, Paul Mineiro, and R. J. Williams
an n-dimensional stochastic process Xl that satises the following SDE:2 dX l ( t) D m (Xl ( t) , l) dt C sdB(t), t 2 [0, T], l
X (0) » º.
(1.4) (1.5)
Here m D (m 1 , . . . , m n ) 0 is called the drift vector. We use the same drift as recurrent neural networks, but the learning algorithm presented here is general and can be applied to other drift functions; B is a standard n-dimensional Brownian motion (see section 2), which provides the random driving noise for the dynamics; s > 0 is a xed positive constant called the dispersion, which governs the amplitude of the noise; T > 0 is the length of the time interval over which the model is used; and º is the probability distribution of the initial state Xl (0) . We regard T > 0 and º as xed henceforth. 1.3 Relationship to Other Models. Figure 1 shows the relationship between diffusion networks and other approaches in the neural network and stochastic ltering literature. Diffusion networks belong to the category of partially observable Markov models (Campillo & Le Gland, 1989), a category that includes standard hidden Markov models and Kalman lters as special cases. Standard hidden Markov models are dened in discrete time, and their internal states are discrete. Diffusion networks can be viewed as hidden Markov models with continuous-valued hidden states and continuous-time dynamics. The continuous-time nature of the networks is convenient for data with dropouts or variable sample rates, since continuous-time models dene all of the nite dimensional distributions. The continuous-state representation (see Figure 2) is well suited for problems involving inference about continuous unobservable quantities, such as object tracking tasks, and modeling of cognitive processes (McClelland, 1993; Movellan & McClelland, 2001). If the logistic activation function Q is replaced by a linear function, the weights between the observable units and from the observable to the hidden units are set to zero, the r parameters are set to zero, and the probability distribution of the initial states is constrained to be gaussian, diffusion networks have the same dynamics underlying the continuous-time Kalman-Bucy lter (Kalman & Bucy, 1961). If, on the other hand, the weight matrix w is symmetric, at stochastic equilibrium, diffusion networks behave like continuous-time, continuous-state Boltzmann machines (Ackley et al., 2
We use Xl to make explicit the dependence of the solution process on the parameter l. The following assumptions are sufcient for existence and uniqueness in distribution of solutions to equations 1.4 and 1.5; º has bounded support, m (¢, l) is continuous and satises a linear growth condition |m (u, l)| · Kl (1 C |u|), for some Kl > 0 and all u 2 Rn , where | ¢ | denotes the Euclidean norm (see e.g., proposition 5.3.6 in Karatzas & Shreve, 1991). These assumptions are satised by the recurrent neural network drift function.
A Monte Carlo EM Approach
1511
Figure 1: Relationship between diffusion networks and other approaches in the neural network and stochastic ltering literature.
Figure 2: In stochastic differential equations (left), the states are continuous and the dynamics are probabilistic. Given a state at time t, there is a distribution of possible states at time t C dt. In ordinary differential equations (center), the states are continuous, and the dynamics are deterministic. In hidden Markov models (right) the hidden states are discrete and probabilistic. This is represented in the gure by partitioning a continuous state-space into four discrete regions and assigning equal transition probability to states within the same region.
1985). Finally, if the dispersion constant s is set to zero, the network becomes a standard deterministic recurrent neural network (Pearlmutter, 1995). In the past, algorithms have been proposed to train expected values and equilibrium points of diffusion networks (Movellan, 1994, 1998). Here, we present a powerful and general algorithm to learn distributions of trajectories. The algorithm can be used to train diffusion networks with hidden units, thus maintaining the versatility of the recurrent neural network approach. As we will illustrate, the algorithm is sufciently fast to train networks with thousands of parameters using current computers and fares well when compared to other probabilistic approaches, such as hidden Markov models.
1512
Javier R. Movellan, Paul Mineiro, and R. J. Williams
2 Mathematical Preliminaries 2.1 Brownian Motion and Stochastic Differential Equations. Brownian motion is a stochastic process originally designed to model the behavior of pollen grains subject to random collisions with molecules of water. A stochastic process B D fB(t) , t 2 [0, 1) g is a standard one-dimensional Brownian motion under a probability measure P if (1) it starts at 0 with Pprobability one; (2) for any t 2 [0, 1) , D t > 0, the increment B ( t C D t) ¡ B ( t) is a gaussian random variable under P with zero mean and variance D t; and (3) for all l 2 N, and 0 · t0 < t1 ¢ ¢ ¢ < tl < 1, the increments B (t k ) ¡B(tk¡1 ) , k D 1, ¢ ¢ ¢ , l are independent random variables under P. An ndimensional Brownian motion consists of n independent one-dimensional Brownian motion processes. One can always choose a realization of Brownian motion that has continuous paths. 3 However, these paths are nowhere differentiable with probability one (Karatzas & Shreve, 1991). Brownian motion can be realized as the limit in distribution as D t ! 0 of the processes obtained by the following iterative scheme: B (0) D 0,
(2.1)
B ( tkC 1 ) D B ( tk ) C
p
D tZ ( tk ) ,
(2.2)
B ( s) D B ( tk ) , for s 2 [tk , tkC 1 ) ,
(2.3)
where tk D kD t, k D 0, 1, ¢ ¢ ¢ , and the Z (t k ) terms are independent gaussian random variables with zero mean and unit variance. It is common in the engineering literature to represent stochastic differential equations as ordinary differential equations with an additive white noise component, dX ( t) D m ( X ( t) ) C sW ( t ) , dt
(2.4)
where W represents white noise, a stochastic process that is required to have the following properties: (1) W is a zero-mean stationary gaussian process; (2) for all t, t0 , the covariance between W ( t0 ) and W ( t) is a Dirac delta function of t ¡ t0 . While white noise does not exist as a proper stochastic process, equation 2.4 can be given mathematical meaning by thinking of it as dening a process of the form Z X ( t) D X (0) C 3
Z
t 0
m (X ( s ) ) ds C s
t
W ( s) ds, 0
A Brownian motion for which P(for all t ¸ 0, lim s!t Xs D Xt ) D 1.
(2.5)
A Monte Carlo EM Approach
1513
Rt where 0 W ( s) ds is now a stochastic process required to have continuous paths and zero-mean, independent stationary increments. It turns out that Brownian motion is the only process with these properties. Thus, in stochastic calculus, equation 2.4 is seen just as a symbolic pointer to the following integral equation, Z X (t ) D X (0) C
t 0
m ( X (s ) ) ds C sB(t) ,
(2.6)
where B is Brownian motion. Moreover, if it existed, the white noise W (t) would be the temporal derivative of Brownian motion, dB (t) / dt. Since Brownian paths are nowhere differentiable with probability one, in the mathematical literature, the following symbolic form is preferred to equation 2.4, and it is the one adopted in this article: dX ( t) D m ( X (t ) ) dt C sdB(t) .
(2.7)
Intuitive understanding of the solution to this equation can be gained by thinking of it as the limit as D t ! 0 of the processes dened by the following iterative scheme, p (2.8) X (t kC 1 ) D X ( tk ) C m (X ( tk ) ) D t C s D tZ (tk ) , X (s ) D X ( tk ) for s 2 [tk, tkC 1 ) ,
(2.9)
where tk D kD t, k D 0, 1, . . .. Under mild assumptions on m and the initial distribution of X, the processes obtained by this scheme converge in distribution to the solution of equation 2.7 (Kloeden & Platen, 1992). Under the assumptions in this article, the solution of equation 2.7 is unique only in distribution, and thus it is called a distributional (also known as a weak or statistical) solution.4 2.2 Probability Measures and Densities. We think of a random experiment as a single run of a diffusion network in the time interval [0, T]. The outcome of an experiment is a continuous n-dimensional path x: [0, T] ! Rn describing the state of the n units of a diffusion network throughout time. We let V denote the set of continuous functions dened on [0, T] taking values in Rn . This contains all possible outcomes. We are generally interested in measuring probabilities of sets of outcomes. We call these sets events, and we let F represent the set of events. 5 4 For an explanation of the notion of weak solutions of stochastic differential equations, see section 5.3 of Karatzas and Shreve (1991). 5 In this article, the set of events F is the smallest sigma algebra containing the open sets of V. A set A of V is open if for every x 2 A there exists a d > 0 such that the set B(x, d) D fv 2 V: maxt2[0,T] |x(t) ¡ v (t)| < dg is a subset of A.
1514
Javier R. Movellan, Paul Mineiro, and R. J. Williams
A probability measure Q is a function Q: F ! [0, 1] that assigns probabilities to events in accordance with the standard probability axioms (Billingsley, 1995). A random variable Y is a function Y: V ! R such that for each open set A in R, the inverse image of A under Y is an event. We represent expected values using integral notation. For example, the expected value of the random variable Y with respect to the probability measure Q is represented as follows: Z EQ ( Y ) D
Y (x ) dQ (x ) .
(2.10)
V
Probability densities of continuous paths are dened as follows. Let P and Q be probability measures on (V , F ) . If it exists, the density of P with respect to Q is a nonnegative random variable L that satises the following relationship, Z
Z
P(
E Y) D
V
Y (x ) dP ( x ) D
V
Y (x ) L ( x) dQ ( x) D EQ (YL ) ,
(2.11)
for any random variable Y with nite expected value under P. The function L is called the Radon-Nikodym derivative or Radon-Nikodym density of P with respect to Q and is commonly represented as dP / dQ. Conditions for the existence of these derivatives can be found in any measure theory textbook (Billingsley, 1995). Intuitively, ( dP / dQ ) ( x ) represents how many times the path x is likely to occur under P relative to the number of times it is likely to occur under Q. In this article, we concentrate on the probability distributions on the space V of continuous paths associated with diffusion networks, and thus we need to consider distributional solutions of SDEs. We let the n-dimensional random process Xl represent a solution of equations 1.4 and 1.5. We regard the rst d components of Xl as observable and denote them by Ol . The last n ¡d components of Xl are denoted by Hl and are regarded as unobservable or hidden. We dene the observable and hidden outcome spaces Vo and Vh with associated event spaces Fo and Fh by replacing n by d and n¡d, respectively, in the denitions of V and F provided previously. The observable and hidden components Ol and Hl of a solution Xl D ( Ol , Hl ) of equations 1.4 and 1.5 take values in Vo and Vh , respectively. Note that V D Vo £ Vh and F is generated by sets of the form Ao £ Ah where Ao 2 Fo and Ah 2 Fh . For each path x 2 V, we write x D ( xo , xh ), where xo 2 Vo and xh 2 Vh . The process Xl induces a unique probability distribution Ql on the measurable space (V , F ) . Intuitively, Ql ( A ) represents the probability that a diffusion network with parameter l produces paths in the set A. Since our goal is to learn a probability distribution of observable paths, we need the
A Monte Carlo EM Approach
1515
probability measure Qlo associated with the observable components alone. If there are no hidden units, d D n and Qlo D Ql . If there are hidden units, d < n, and we must marginalize Ql over the unobservable components: Qlo ( Ao ) D Ql ( Ao £ Vh )
for all Ao 2 Fo .
(2.12)
We will also need to work with the marginal probability measure of the hidden components: Qlh ( Ah ) D Ql (Vo £ Ah )
for all Ah 2 Fh .
(2.13)
Intuitively, Qlo ( Ao ) is the probability that a diffusion network with parameter l generates observable paths in the set Ao , and Qlh ( Ah ) is the probability that the hidden paths are in the set Ah . Finally, we set Q D fQl : l 2 Rp g and Qo D fQlo : l 2 Rp g, referring to the entire family of probability measures parameterized by l and dened on (V , F ) and (Vo , Fo ), respectively.
3 Density of Observable Paths Our goal is to select a value of l on the basis of training data, such that Qlo best approximates a desired distribution. To describe a maximum likelihood or a Bayesian estimation approach, we need to dene probability densities Llo of continuous observable paths. In discrete time systems, like hidden Markov models, the Lebesgue measure 6 is used as the standard reference with respect to which probability densities are dened. Unfortunately, for continuous-time systems, the Lebesgue measure is no longer valid. Instead, our reference measure R will be the probability measure induced by a diffusion network with dispersion s, initial distribution º but with no drift,7 so that R ( A ) represents the probability that such a network generates paths lying in the set A. An important theorem in stochastic calculus, known as Girsanov’s theorem,8 tells us how to compute the relative density of processes with the same diffusion term and different drifts (see Oksendal, 1998). Using Girsanov’s
6
The Lebesgue measure of an interval (a, b) R is bu¡ a, the length of that interval. n More formally, for each A 2 F, R (A) D n R (A) dº(u) where for each u 2 R the measure Ru on (V, F) is such that under it, the process B D fB(t, x) D (x (t) ¡ x (0)) / s: t 2 [0, T], x 2 Vg is a standard n-dimensional Brownian motion and Ru (x (0) D u) D 1. 8 The conditions on m mentioned in the introduction are sufcient for Girsanov’s theorem to hold. 7
1516
Javier R. Movellan, Paul Mineiro, and R. J. Williams
theorem, it can be shown that l(
L x) D
dQl ( x ) D exp dR
( 1 s2
Z
T 0
1 ¡ 2 2s
m (x ( t) , l) ¢ dx ( t) Z
T
) |m ( x ( t) , l) | dt 2
(3.1)
0
8 Á Z T <X n 1 m j ( x ( t) , l) dxj (t ) D exp : s2 0 j D1 ¡
1 2s 2
Z
T 0
(m j ( x ( t) , l) ) 2 dt
!9 = ;
,
(3.2)
R for x 2 V. The integral 0T m j (x ( t) , l) dxj (t) is an Itoˆ stochastic integral (see Oksendal, 1998). Intuitively, we can think of it as the (mean square) limit, P ( ( ) ( ( ) ( )) as D t ! 0 of the sum: l¡1 kD 0 m j x t k , l) xj t k C 1 ¡ xj t k , where 0 D t 0 < t1 ¢ ¢ ¢ < tl D T are the sampling times and tkC 1 D tk C D t, and D t > 0 is the sampling period. The term Ll is a Radon-Nikodym derivative. Intuitively, it represents the likelihood of a diffusion network with parameter l generating the path x relative to the likelihood for the reference diffusion network with no drift. For a xed path x 2 V, the term Ll ( x) can be treated as a likelihood function9 of l. If there are no hidden units, d D n, Ql D Qlo , and thus we can take Ro D R and Llo D Ll . If there are hidden units, more work is required. For the construction here, we impose the condition that the initial probability measure º is a product measure º D ºo £ºh , where ºo , ºh are probability measures on Rd and Rn¡d , respectively.10 It then follows from the independence of the components of Brownian motion that R ( Ao £ Ah ) D Ro ( Ao ) Rh ( Ah ) ,
(3.3)
for all Ao 2 Fo , Ah 2 Fh , where Ro ( Ao ) D R ( Ao £ Vh ) ,
(3.4)
Rh ( Ah ) D R (Vo £ Ah ) .
(3.5)
9 Unfortunately, since R depends on s, Ll cannot be treated as a likelihood function of s. For this reason, estimation of s needs to be treated differently from estimation of l for continuous-time problems. We leave the issue of continuous-time estimation of s for future work. 10 Here, Rd and Rn¡d are endowed with their Borel sigma algebras, which are generated by the open sets in these spaces.
A Monte Carlo EM Approach
1517
To nd an appropriate density for Qlo , note that for Ao 2 Fo , Qlo ( Ao ) D Ql ( Ao £ Vh ) Z Z D Ll ( xo , xh ) dRh (xh ) dRo ( xo ) ,
(3.6) (3.7)
Ao V h
and therefore the density of Qlo with respect to Ro is Llo (xo ) D
dQlo ( xo ) D dRo
Z Vh
Ll ( xo , xh ) dRh ( xh ) , for xo 2 Vo.
(3.8)
Similarly, the density of Qlh with respect to Rh is as follows: Llh (xh ) D
dQlh ( xh ) D dRh
Z Vo
Ll ( xo, xh ) dR o (xo ) , for xh 2 Vh .
(3.9)
4 Log-Likelihood Gradients Let Po represent a probability measure on (Vo , Fo ) that we wish to approximate (e.g., a distribution of paths determined by the environment). Our goal is to nd a probability measure from the family Qo that best matches Po . The hope is that this will provide an approximate model of Po , the environment, that could be used for tasks such as sequence recognition, sequence generation, or stochastic ltering. We approach this modeling problem by dening a Kullback-Leibler distance (White, 1996) between Po and Qlo : ³ ´ Lo (4.1) EPo log l , Lo where EPo is an expected value with respect to Po and L o D dPo / dRo is the density of the desired probability measure. 11 We seek values of l that minimize equation 4.1. In practice, we estimate such values by obtaining nQ fair sample paths12 fxio gniQD 1 from Po and seeking values of l that maximize the following function: Á 1 W (l) D nQ
nQ X iD 1
! log Llo ( xio )
¡ Y (l) .
(4.2)
11 We are assuming that the expectation in equation 4.1 is well dened and nite and in particular L o exists, which is also an implicit assumption in nite dimensional density estimation approaches. 12 Fair samples are samples obtained in an independent and identically distributed manner.
1518
Javier R. Movellan, Paul Mineiro, and R. J. Williams
This function is a combination of the empirical log-likelihood and a regularizer Y: R p ! R that encodes prior knowledge about desirable values of l (Bishop, 1995). To simplify the notation, hereafter we present results for nQ D 1 and Y (l) D 0 for all l 2 Rp . Generalizing the analysis to nQ > 1 and interesting regularizers is easy but obscures the presentation. The gradient of the log density of the observable paths with respect to l is of interest to apply optimization techniques such as gradient ascent and the EM algorithm. If there are no hidden units, equation 3.1 gives the density of the observable measure, and differentiation yields13 ÁZ n T @m ( x ( t ) , l) 1 X j log L ( x ) D 2 dxj ( t) s jD 1 @li @li 0 @
l
Z ¡ D
T 0
@m j ( x ( t) , l) @li
!
m j ( x ( t) , l) dt
n Z T m ( ( ) l) @ j x t , 1 X dIjl ( x, t) , for x 2 V, 2 s jD 1 0 @li
(4.3)
(4.4)
where s ¡1 Il is a (joint) innovation process (Poor, 1994): Z
t
l(
I x, t ) D x (t) ¡ x (0) ¡
m ( x (s ) , l) ds.
(4.5)
0
Such a process is a standard n-dimensional Brownian motion under Ql .
5 Stochastic Teacher Forcing If there are no hidden units, the likelihood and log-likelihood gradient of paths can be obtained directly via equations 3.1 and 4.4. If there are hidden units, we need to take expected values over hidden paths: Z Llo ( xo )
D Vh
13
Ll ( xo , xh ) dRh (xh ) D ERh ( Ll (xo , ¢) ) .
(5.1)
Conditions that are sufcient to justify the differentiation leading to equation 4.4 are that the rst and second partial derivatives of m (u, l) with respect to l exist and together with m are continuous in (u, l) and satisfy a linear growth condition of the form |m (u, l)| · Kl (1 C |u| ), where Kl can be chosen independent of l whenever l is restricted to a compact set in Rp (Levanony, Shwartz, & Zeitouni, 1990; Protter, 1990). These conditions are satised by the neural network drift (see equation 1.2).
A Monte Carlo EM Approach
1519
In this article, we propose estimates of this likelihood obtained by adapting a technique known as Monte Carlo importance sampling (Fishman, 1996). Instead of averaging with respect to Rh , we average with respect to another distribution Sh and multiply the variables being averaged by a correction factor known as the importance function. This correction factor guarantees that the new Monte Carlo estimates will remain unbiased. However, by using an appropriate sampling distribution Sh , the estimates may be more efcient than if we just sample from Rh (they may require fewer samples to obtain a desired level of precision). Following equation 2.11, we have that ³ ´ Z dR h dRh l l l Sh ( ) ( ) ( ) ( ) ( (¢) Lo x o D L xo , xh xh dSh xh D E L xo , ¢) , (5.2) dSh dSh Vh
where Sh is a xed distribution on (Vh , Fh ) for which the density dRh / dSh exists. This density thus acts as the desired importance function. We can obtain unbiased Monte Carlo estimates of the expected value in equation 5.2 by averaging over a set of hidden paths H D fh1 , . . . , hm g sampled from Sh : LO lo (xo ) D
m X
pl (xo , hl ) ,
(5.3)
l D1
where ( l(
p xo , h) D
1 l( m L xo ,
h ( ) h ) dR dSh h ,
0
for h 2 H , else.
(5.4)
We use the gradient of the log of this density estimate for training the network
rl log LO lo ( xo ) D
m X lD 1
plh|o ( hl | xo ) rl log Ll ( xo , hl ) ,
(5.5)
where plh|o (h | xo ) D
pl ( xo , h ) , for h 2 Vh , plo ( xo )
(5.6)
and plo (xo ) D LO lo ( xo ) D
m X lD 1
pl ( xo , hl ) .
(5.7)
1520
Javier R. Movellan, Paul Mineiro, and R. J. Williams
P
plh|o ( h | xo ) D 1, and thus we can think of plh|o (¢ | xo ) as a probability mass function on Vh . Moreover, rl log LO lo ( xo ) is the average of the joint log-likelihood gradient rl log Ll ( xo , ¢) with respect to that probability mass function. While the results presented here are general and work for any sampling distribution for which dRh / dSh exists, it is important to nd a distribution Sh from which we know how to sample, for which we know dRh / dSh , and for which the Monte Carlo estimates are relatively efcient (i.e., do not require a large number of samples to achieve a desired reliability level). An obvious choice is to sample from Rh itself, in which case dRh / dSh D 1. In fact, this is the approach we used in previous versions of this article (Mineiro, Movellan, & Williams, 1998) and which was also recently proposed in Solo (2000). The problem with sampling from Rh is that as learning progresses, the Monte Carlo estimates become less and less reliable, and thus there is a need for better sampling distributions that change as learning progresses. We have obtained good results by sampling in a manner reminiscent of the teacher forcing method from deterministic neural networks (Hertz, Krogh, & Palmer, 1991). The idea is to obtain a sample of hidden paths from a network whose observable units have been forced to exhibit the desired observable path xo . Consider a diffusion network with parameter vector ls 2 Rp , not necessarily equal to l. The role of this network will be to provide sample hidden paths. For each time t 2 [0, T], we x the output units of such a network to xo (t ), and let the hidden units run according to their natural dynamics, Note
h2Vh
dH (t ) D m h (xo ( t) , H ( t) , ls ) dt C sdBh ( t) ,
(5.8)
H (0) » ºh , where m h is the hidden component of the drift. We repeat this procedure m times to obtain the sample of hidden paths H D fhl gm . One advantage of l D1 this sampling scheme is that we can use Girsanov’s theorem to obtain the desired importance function, ( Z T 1 dRh ( xh ) D exp ¡ 2 m h ( xo ( t) , xh (t ) , ls ) ¢ dxh ( t) s 0 dSh ) Z T 1 2 ( ( ) ( ) ) C |m | dt , h xo t , xh t , ls 2s 2 0
(5.9)
where Sh , the sampling distribution, is now the distribution of hidden paths induced by the network in equation 5.8 with xed parameter ls . If we let
A Monte Carlo EM Approach
1521
ls D l then the pl function dened in equation 5.4 simplies as follows: 1 l dR h ( h) L ( xo , h ) m dSh ( Z T 1 1 exp m o ( xo ( t) , h ( t) , l) ¢ dxo (t ) D s2 0 m
pl ( xo , h) D
Z
1 ¡ 2 2s
T 0
) |m o (xo ( t) , h ( t) , l) | dt , 2
for h 2 H .
(5.10)
6 Monte Carlo EM learning Given a xed sampling distribution Sh and a xed set of hidden paths H D fh1 , . . . , hm g sampled from Sh , our objective is to nd values of l that maximize the estimate of the likelihood LO lo ( xo ). We can search for such values using standard iterative procedures, like gradient ascent or conjugate gradient. Another iterative approach of interest is the EM algorithm (Dempster, Laird, & Rubin, 1977). On each iteration of the EM algorithm, we start with a xed parameter vector lN and search for values of l that optimize the following expression:14 N l, xo ) D M (l, Note that since then
m X
N
plh|o ( hl | xo ) log pl (xo , hl ).
(6.1)
l D1
Pm
lN ( l l D1 ph|o h
| xo ) D 1 and plo ( xo ) D pl ( xo , hl ) / plh|o ( hl | xo ),
log LO lo ( xo ) D log plo ( xo ) D
m X
N
plh|o ( hl | xo ) log
lD 1
N , l, xo ) ¡ D M (l
m X
pl ( xo , hl ) plh|o ( hl | xo )
N
plh|o ( hl | xo ) log plh|o ( hl | xo ) ,
(6.2)
(6.3)
lD 1
and N
N l, N xo ) log LO lo (xo ) ¡ log LO lo ( xo ) D M ( lN , l, xo ) ¡ M ( l,
N C KL(plh|o (¢ | xo ) , plh|o (¢ | xo ) ) ,
(6.4)
14 If a regularizer is used, redene pl (xo , h) in equation 5.4 as Ll (xo , h)dRh / dSh (h) exp (¡Y (l)).
1522
Javier R. Movellan, Paul Mineiro, and R. J. Williams
where KL stands for the Kullback-Leibler distance between the functions N plh|o (¢ | xo ), and plh|o (¢ | xo ). Since KL is nonnegative, it follows that if N l, xo ) > M ( lN , l, N xo ) , then LO lo ( xo ) > LO loN (xo ) . Since this adaptation of M (l, the EM algorithm maximizes an estimate of the log likelihood instead of the log likelihood itself, we refer to it as stochastic EM. On each iteration of the stochastic EM procedure, we nd a value of l that maximizes M ( lN , l, xo ) , and we let lN take that value. The procedure guarantees that the estimate of the likelihood of the observed path LO lo ( xo ) will increase or stay the same, at which point we have converged. Our approach guarantees convergence only when the sampling distribution Sh and the sample of hidden paths H are xed. In practice, we have found that even with relatively small sample sizes, it is benecial to change Sh and H as learning progresses. For instance, in the approach proposed at the end of section 7, we change the sampling distribution and the sample paths after each iteration of the EM algorithm. While we have not carefully analyzed the properties of this approach, we believe the reason that it works well is that as learning progresses, it uses sampling distributions Sh that are better suited for the values of l that EM moves into. 7 The Neural Network Case Up to now we have presented the learning algorithm in a general manner applicable to generic SDE models. In this section, we show how the algorithm applies to diffusion neural networks. Let wsr be the component of l representing the synaptic weight (conductance) from the sending unit s to the receiving unit r. For the neural network drift, we have @m j ( x ( t) , l) @wsr
D kj
n X
@
@wsr iD1
Q ( xi (t) ) wij D djrkrQ ( xs ( t) ) , for x 2 V,
(7.1)
where d is the Kronecker delta function (djr D 1 if j D r, 0 else) and xi stands for the ith component of x (i.e., the activation of unit i at time t in path x). 7.1 Networks Without Hidden Units. Combining equations 4.4 and 7.1, we get @ log Ll ( x ) @wsr
D
n Z T @m j ( x ( t ) , l) l 1 X dIj ( x, t ) s 2 jD 1 0 @wsr
D
Z T n 1 X d k Q (xs (t) ) dIjl ( x, t) jr r s 2 jD 1 0
D
1 kr s2
Z
T 0
Q (xs ( t) ) dI lr ( x, t) ,
(7.2)
(7.3)
(7.4)
A Monte Carlo EM Approach
1523
where s ¡1 Irl is the innovation process of the receiving unit, that is, dI lr ( x, t ) D dxr (t ) ¡ m r ( x ( t) , l) dt
(7.5)
D dxr ( t ) C krr r xr ( t) dt ¡ krjr dt ¡ kr
n X
Q ( xi (t) ) wir dt.
i D1
Combining equations 7.4 and 7.5, we get @ log Ll ( x) @wsr
Á ! n X 1 2 D 2 k r bsr ( x ) ¡ asi ( x ) wir , s iD 1
(7.6)
where bsr (x ) D
1
Z
kr
Z
T 0
Q ( xs (t ) ) dxr ( t) C r r
Z ¡ jr Z asi (x ) D
T
Q (xs (t) ) xr ( t) dt 0
T 0
Q (xs ( t) ) dt,
(7.7)
T 0
Q (xs ( t) ) Q (xi ( t) ) dt.
(7.8)
Let rw log Ll (x ) be an n £ n matrix with cell i, j containing the derivative of log Ll ( x ) with respect to wij for i, j D 1, . . . , n. It follows that
rw log Ll ( x ) D
1 ( b ( x) ¡ a ( x) w ) K s2
2
,
(7.9)
where K is a diagonal matrix with diagonal elements k1 , . . . , kn . Note that all the terms involved in the computation of the gradient of the log likelihood are Hebbian; they involve time integrals of pairwise activation products. Also note that each cell of the matrix a is an inner product in the Hilbert space of squared integrable functions. This may open avenues for generalizing standard kernel methods (Aizerman et al., 1964; Burges, 1998) for learning distributions of sequences. n ( ) ( ) PnThe matrix a x is positive semidenite. To see why, let v 2 R , y t D ( ( ) ) [0, Q for all 2 and note that v x t t T], j jD 1 j Z
T 0
y (t ) 2 dt D v0 av ¸ 0.
(7.10)
Thus, if a ( x ) is invertible, there is a unique maximum for the log-likelihood function. The maximum likelihood estimate of w follows: wO D ( a ( x) ) ¡1 b ( x ) .
(7.11)
1524
Javier R. Movellan, Paul Mineiro, and R. J. Williams
A similar procedure can be followed to nd the maximum likelihood estimates of the other parameters: ² R ± P ( xr (T ) ¡ xr (0) )k r¡1 C 0T rr xr ( t) ¡ jnD1 wjrQ (xj ( t) ) dt jOr D , (7.12) T RT R Pn ¡1 T ( )j ( ) ( ) C j D 1 Q ( xj ( t ) ) wjr xr ( t ) dt ¡ k r 0 xr t r dt 0 xr t dx r t rOr D , (7.13) RT 2( ) 0 xr t dt RT mN r ( x (t) , l) dxr ( t) kO r D 0R T (7.14) , for r D 1, . . . , n, 2 ( () 0 mN r x t , l) dt where mN r (x ( t) , l) D m r (x ( t) , l) /kr , which is not a function of k r . Equations 7.11 through 7.14 maximize a parameter or set of parameters assuming the other parameters are xed.
7.2 Networks with Hidden Units. If there are hidden units, we use the stochastic EM approach presented in section 6. Given an observable path, xo 2 Vo , we obtain a fair sample of hidden paths Hm D fh1 , . . . , hm g from a sampling distribution Sh . The gradient of the log–likelihood estimate with respect to w is as follows: 1 rw log LO lo ( xo ) D 2 (bQl (xo ) ¡ aQl ( xo ) w) K s
2
,
(7.15)
where bQl (xo ) D aQ l (xo ) D
m X
plh|o (hl | xo ) b ( xo , hl ) ,
(7.16)
plh|o (hl | xo ) a ( xo , hl ) ,
(7.17)
l D1 m X l D1
and a, b are given by equations 7.7 and 7.8. Note that in this case, the aQ and bQ coefcients depend on w in a nonlinear manner in general, even if the activation function Q is linear. We can nd values of w for which the gradient vanishes by using standard iterative procedures like gradient ascent or conjugate gradient. The situation simplies when the EM algorithm is used. Let lN and l represent the parameter vectors of two n-dimensional diffusion networks that have the same values for the j, k and r terms. Let wN and w represent the connectivity matrices of these two networks. Following the argument presented in section 6, we seek values of w that maximize N l, xo ) . To do so, we rst nd the gradient with respect to w: M (l,
rw M ( lN , l, xo ) D
1 QlN ( b ( xo ) ¡ aQ lN ( xo ) w ) K s2
2
.
(7.18)
A Monte Carlo EM Approach
1525
The matrix aQ lN ( xo ) is an average of positive semidenite matrices, and thus it is also positive semidenite. Thus, if aQ lN is invertible, we can directly maximize M ( lN , l, xo ) with respect to w by setting the gradient to zero and solving for w. The solution, N N wO D ( aQ l ( xo ) ) ¡1 bQl ( xo ) ,
(7.19)
becomes the new w, N and equation 7.19 is iterated until convergence is achieved. Note that this procedure is iterative and guarantees convergence only to a local maximum of the estimate of the likelihood function. A similar procedure can be followed to train the j, k , and r parameters. Summary of the Stochastic EM Learning Algorithm for the Neural Network Case 1. Choose an initial network with parameter lN 2 Rp including the values of w, N jN , kN , and r. N The initial connectivity matrix wN is commonly the zero matrix. Hereafter, we concentrate on how to train the connectivity matrix. Training the other parameters in lN is straightforward. 2. With the observation units forced to exhibit a desired sequence xo , run the hidden part of the network m times, starting at time 0 and ending at time T, to obtain m different hidden paths: h1 , . . . , hm . 3. Compute the weight, here represented as p (¢) , of each hidden path: 8 < 1 X d Z T p ( l) D exp m ( xl ( t) , lN ) dxjl ( t) : s 2 D1 0 j j ) Z T 1 2 l N (m j ( x ( t) , l) ) dt , ¡ 2 (7.20) 2s 0 for l D 1, . . . , m, where xl D ( xo , hl ) is a joint sequence consisting of the desired sequence xo for the observation units and the sampled sequence hl for the hidden units, and the index j D 1, . . . , d goes over the output units. In practice, the integrals in equation 7.20 are approximated using discrete time approximations, as described in section 9. 4. Compute the a and b matrices of each hidden sequence: Z T Q ( xli ( t) ) Q ( xjl ( t) ) dt, for i, j D 1, . . . , n, aij ( l) D 0
bij ( l) D
1
kj
Z
T 0
¡ jjl
Z
Q ( xli ( t) ) dxjl ( t) C rj
Z
T 0
T 0
(7.21)
Q ( xli ( t) ) xjl ( t) dt
Q (xli ( t) ) dt, for i, j D 1, . . . , n.
(7.22)
1526
Javier R. Movellan, Paul Mineiro, and R. J. Williams
Q 5. Compute the averaged matrices aQ and b: Pm l D1 p ( l ) a ( l ) P , m l D1 p ( l ) Pm l D1 p ( l ) b ( l ) . bQ D P m lD 1 p ( l) aQ D
(7.23) (7.24)
6. Update the connectivity matrix: Q wN D aQ ¡1 b.
(7.25)
7. Go back to step 2 using the new value of w. N 8 Comparison to Previous Work We use diffusion networks as a way to parameterize distributions of continuous-time varying signals. We proposed methods for nding local maxima of a likelihood estimate. This process is known as learning in the neural network literature, system identication in the engineering literature, and parameter estimation in the statistical literature. The main motivation for the diffusion network approach is to combine the versatility of recurrent neural networks (Pearlmutter, 1995) with the well-known advantages of stochastic models(Oksendal, 1998). Thus, our work is closely related to the literature on continuous-time recurrent neural networks and the literature on stochastic ltering. 8.1 Continuous-Time Recurrent Neural Networks. We use a recurrent neural network drift function and allow full interconnectivity between hidden and observable units, as is standard in neural network applications. In recurrent neural networks, the dynamics are deterministic (s D 0), while in diffusion networks they are not (s > 0). Learning algorithms for recurrent neural networks typically nd values of the parameter vector l that minimize a mean squared error of the following form (Pearlmutter, 1995), Z W (l) D
T 0
( o ( t, l) ¡ r ( t) ) 2 dt,
(8.1)
where r is a desired path we want the network to learn and o (¢, l) is the unique observable path produced by a recurrent neural network with parameter vector l. Since the optimal properties of maximum-likelihood methods are well understood mathematically, it is of interest to nd under what conditions the solutions found by minimizing equation 8.1 are maximumlikelihood estimates. This will give us a sense of the likelihood model implicitly used by standard learning algorithms for deterministic neural networks.
A Monte Carlo EM Approach
1527
For a given deterministic neural network with parameter vector l, we dene a stochastic process Ol by adding white noise to the unique observable path o (¢, l) produced by that network: Ol ( t) D o ( t, l) C sW (t) .
(8.2)
To obtain a mathematical interpretation of equation 8.2, we introduce Z Zl ( t) D
t
Ol ( s) ds,
(8.3)
0
which is thus governed by the following SDE, dZ l (t ) D o ( t, l) dt C sdB(t),
(8.4)
where B is standard Brownian motion. Note that if Zl is known, Ol is known and vice versa, so no information is lost by working with Zl instead of Ol . To nd the likelihood of a path o given a network with parameter vector l, we rst compute the integral trajectory z: Z z ( t) D
t
o ( s) ds.
(8.5)
0
Using Girsanov’s theorem, the likelihood of z given that is generated as a realization of the SDE model in equation 8.4 can be shown to be as follows: Ll ( z ) D
1 s2
Z
T 0
o ( t, l) dz ( t) ¡
1 2s 2
Z
T
o (t, l) 2 dt.
(8.6)
0
Considering that dz (t) D o ( t) dt, it is easy to see that maximizing Ll ( z) with respect to l is equivalent to minimizing W (l) of equation 8.1. Thus, standard continuous-time neural network learning algorithms can be seen as performing maximum likelihood estimation with respect to the stochastic process dened in equation 8.2. In other words, the generative model underlying those algorithms consists of adding white noise to the unique observable trajectory generated by a network with deterministic dynamics. Note that this is an extremely poor generative model. In particular, this model cannot handle bifurcations and time warping, two common sources of variability in natural signals. Instead of adding white noise to a deterministic path, in diffusion networks, we add white noise to the activation increments; that is, the activations are governed by an SDE of the form dX l ( t) D m ( X (t) , l) dt C sdB(t) .
(8.7)
1528
Javier R. Movellan, Paul Mineiro, and R. J. Williams
Figure 3: An illustration of the different effects of adding noise to the output of a deterministic network versus adding it to the network dynamics. See the text for an explanation.
This results in a much richer and more realistic likelihood model. Due to the probabilistic nature of the state transitions, the approach is resistant to time warping, for the same reason hidden Markov models are. In addition, the approach can also handle bifurcations. Figure 3 illustrates this point with a simple example. The gure shows a single neuron model in which the drift is proportional to an energy function with two wells. If we let the network start at the origin and there is no internal noise, the activation of the neuron will be constant, since the drift at the origin is zero. Adding white noise to this constant activation will result in a unimodal gaussian distribution centered at the origin. However, if we add noise to the activation dynamics, the neuron will bifurcate, sometimes moving toward the left well, sometimes toward the right well. This will result in a bimodal distribution of activations. Diffusion networks were originally formulated in Movellan and McClelland (1993). Movellan (1994) presented an algorithm to train diffusion networks to approximate expected values of sequence distributions. Movellan (1998) presented algorithms for training equilibrium probability densities. Movellan and McClelland (2001) showed the relationship between diffusion networks and classical psychophysical models of information integration. Mineiro et al. (1998) presented an algorithm for sequence density estimation with diffusion networks. In that article, we used a gradient-descent approach instead of the EM approach presented here, and we sampled from
A Monte Carlo EM Approach
1529
the reference distribution Rh instead of the distribution of hidden states with clamped observations. These two modications greatly increased learning speed and made possible the use of diffusion networks on realistic problems involving thousands of parameters. 8.2 Stochastic Filtering. Our work is also closely related to the literature on stochastic ltering and systems identication. As mentioned in section 1.3, if the logistic activation function Q is replaced by a linear function, the weights between the observable units and from the observable to the hidden units are set to zero, the r terms are set to zero, and the probability distribution of the initial states is constrained to be gaussian, diffusion networks have the same dynamics as the continuous-time Kalman-Bucy lter (Kalman & Bucy, 1961). While the usefulness of the Kalman-Bucy lter is widely recognized, its limitations (due to the linear and gaussian assumptions) have become clear. For this reason, many extensions of the Kalman lter have been proposed (Lewis, 1986; Fahrmeir, 1992; Meinhold & Singpurwalla, 1975; Sage & Melsa, 1971; Kitagawa, 1996). This article contributes to that literature by proposing a new approach for training partially observable SDE models. An important difference between our work and stochastic ltering approaches is that stochastic ltering is generally restricted to models of the following form: dH ( t) D m h ( H ( t) ) dt C sdBh ( t) ,
(8.8)
dO ( t) D m o ( H (t) ) dt C sdBo ( t) .
(8.9)
In our case, this restriction would require setting to zero the coupling between observable units and the feedback coupling from observable units to hidden units. While this restriction simplies the mathematics of the problem, having a general algorithm that does not require it is important for the following reasons: (1) such restriction is not commonly used in the neural network literature; (2) when modeling biological neural networks, such restriction is not realistic; (3) in many physical processes, the observations are coupled by inertia and dampening processes (e.g., bone and muscles with large dampening properties are involved in the production of observed motor sequences); and (4) in practice, the existence of connections between observable units results in a much richer set of distribution models. Iterative solutions for estimation of drift parameters of partially observable SDE models are well known in the statistics and stochastic ltering literatures. However, most current approaches are very expensive computationally, do not allow training fully coupled systems, and have been used only for problems with very few parameters. Campillo and Le Gland (1989) present an approach that involves numerical solution of stochastic partial differential equations. The approach has been applied to models with only a handful of parameters. Shumway and Stoffer (1982) present an exact EM approach for discrete-time linear systems; however, the approach does not
1530
Javier R. Movellan, Paul Mineiro, and R. J. Williams
generalize to nonlinear systems. Ljung (1999) presents a general approach for discrete-time systems with nonlinear drifts. However, the approach requires the inversion of very large matrices (order of length of the training sequence times number of states). Ghahramani and Roweis (1999) present an ingenious method to learn discrete-time dynamical models with arbitrary drifts. The approach approximates nonlinear drift functions using gaussian radial basis functions. While promising, the approach has been shown to work on only a single parameter problem. Our work is closely related to Kitagawa (1996), the work by Blake and colleagues (North & Blake, 1998; Blake, North, & Isard, 1999), and Solo (2000). Kitagawa (1996) presents a discrete-time Monte Carlo method for general-purpose nonlinear ltering and smoothing. While Kitagawa did not experiment with maximum likelihood estimation of drift parameters, he mentioned the possibility of doing so. Blake et al. (1999) independently presented a Monte Carlo EM algorithm at about the same time we published our earlier work on diffusion networks (Mineiro et al., 1998). The main differences between their approach and ours are as follows: (1) they work in discrete time while we work in continuous time; (2) they require that there be no coupling between observation units and no feedback coupling from observations to hidden units, whereas we do not have such requirements; and (3) they sample from an approximation to the distribution of hidden units given observable paths, as proposed by Kitagawa (1996). Instead, we sample from the distribution of hidden units with clamped observations and use corrections to obtain unbiased estimates of the likelihood gradients. Solo (2000) presents a simulation method for generating approximate likelihood functions for partially observable SDE models and for nding approximate maximum likelihood estimators. His approach can be seen as a discrete-time version of the method we independently proposed in Mineiro et al. (1998). The approach we propose here generalizes (Mineiro et al., 1998; Solo, 2000). We incorporate the use of importance sampling in continuous time and do not need to restrict the observation units to be uncoupled from each other or the hidden units to be uncoupled from the observation units.
9 Simulations Sample source code for the simulations presented here can be found at J. R. M.’s Web site (mplab.ucsd.edu). In practice, lacking hardware implementations of diffusion networks, we model them on digital computers using discrete-time approximations. While more sophisticated techniques are available for simulating SDEs in digital computers (Kloeden & Platen, 1992; Karandikar, 1995), we opted to start with simple rst-order Euler approximations.
A Monte Carlo EM Approach
1531
Consider the term LO lo ( xo ) in equation 5.3. It requires, among other things, computing integrals of the form l(
log L x ) D
1 s2
Z
T 0
m (x ( t) , l) ¢ dx ( t) ¡
1 2s 2
Z
T
|m ( x ( t) , l) | 2 dt,
0
for x 2 V.
(9.1)
In practice, we approximate such integrals using the following sum: s¡1 s¡1 1 X 1 X |m ( x (tk ) , l) | 2 D t, m (x ( tk ) , l) ¢ ( x (tk C 1 ) ¡ x (tk ) ) ¡ 2 2 s kD 0 2s kD 0
(9.2)
where 0 D t0 < t1 ¢ ¢ ¢ < ts D T are the sampling times and tkC 1 D tk C D t, and D t > 0 is the sampling period. The Monte Carlo approach proposed here requires generating sample hidden paths fhl gm from a network with the observable units clamped to l D1 the path xo . We obtain these sample paths by simulating a diffusion network in discrete time as follows: p H ( tkC 1 ) D H (tk ) C m h ( xo (tk ) , H (tk ) , l) D tk C sZk D t k,
(9.3)
H (0) » ºh , where Z1 , . . . , Zs are independent ( n ¡ d )-dimensional gaussian random vectors with zero mean and covariance equal to the identity matrix. 9.1 Example 1: Learning to Bifurcate. This simulation illustrates the capacity of diffusion networks with nonlinear activation functions to learn bifurcations. This is a property important for practical applications, which allows these networks to entertain multiple distinct hypotheses about the state of the world. Standard deterministic neural networks and linear stochastic networks like Kalman lters cannot learn bifurcations. A diffusion network with an observable unit and a hidden unit was trained on the distribution of paths shown in Figure 4. The top of the gure shows there were four equally probable desired paths, organized as a double bifurcation. The network was trained for 60 iterations of the EM algorithm, with 10 sample hidden paths per iteration. The bottom row shows the distribution learned by a standard recurrent neural network. As expected, the network learned the average path—a constant zero output. The trajectories displayed on the gure were produced by estimating the variance of the desired outputs from the learned trajectory and adding gaussian noise of that estimated variance to the output learned by the network. The second row of Figure 4 shows 48 sample paths produced by the diffusion network after training. While not perfect, the network learned to
1532
Javier R. Movellan, Paul Mineiro, and R. J. Williams
Figure 4: (Top row) The double bifurcation denes a desired path distribution with four equally probable paths. (Second row) Observable unit sequences obtained after training a diffusion network. (Third row) Hidden unit sequences of the trained diffusion network. (Bottom row) Sequences obtained after training a standard recurrent neural network and adding an optimal amount of noise to the output of that network.
bifurcate twice and produced a reasonable approximation to the desired distribution. We were surprised that this problem was learnable with a single hidden unit, until we saw the ingenious solution learned by the network (see Table 1). Basically the network learned to toss two binary random variables sequentially and compute their sum. The observable unit learned a relatively small time constant and, if disconnected from the hidden unit, it bifurcated once, with each branch taking approximately equal probability. The hidden unit had a faster time constant and learned to bifurcate unaffected by the observable unit. Thus, the hidden unit basically behaved as a Bernoulli random bias for the observable unit. The result was a double bifurcation of the observable unit, as desired. Figure 5 shows four example paths for the observable and hidden units. 9.2 Example 2: Learning a Beating Heart. In this section we show the capacity of diffusion networks to learn natural oscillations. The task was
A Monte Carlo EM Approach
1533
Figure 5: Four sample paths learned in the bifurcation problem. Solid lines represent observable unit paths, and dashed lines represent hidden unit paths. Table 1: Parameters Learned for the Double Bifurcation Problem. woo D 4.339 who D 2.303
woh D ¡0.008 whh D 0.5912
jo D ¡0.005 jh D 0.000
ko D 0.429 kh D 1.175
Notes: All the weights and biases were initialized to zero and the k terms to 1. The xed parameters were set to the following values: h1 D ¡1, h2 D 2, h3 D 7, s D 0.2.
learning the expansion and contraction sequence of a beating heart. Once a distribution of normal sequences is learned, the network can be used for tracking sequences of moving heart images (Jacob, Noble, and Blake (1998) or to detect the presence of irregularities. Jacob et al. (1998) tracked the contour of a beating heart from a sequence of ultrasound images and applied principal component analysis (PCA) to the resulting sequence of contours. Figure 6 shows the rst principal component coefcient as a function of time for a typical sequence of contours. North and Blake (1998) compared two discrete-time models of the heart expansion sequence: (1) a linear model with time delays and no hidden units and (2) a linear model with time delays and hidden units. The rst model was dened
1534
Javier R. Movellan, Paul Mineiro, and R. J. Williams
by the following stochastic difference equation, O ( t C 1) D O ( t) C l1 C l2 O ( t) C l3 O ( t ¡ 1) C s1 Z1 ( t) ,
(9.4)
where O represents the observed heart expansion, Z1 (1) , Z1 (2) , . . . is a sequence of independent and identically distributed gaussian random variables with zero mean and unit variance, and l1 , l2 , l3 2 R, s1 > 0 are adaptive parameters. The second model had the following form, H ( t C 1) D H ( t) C l1 C l2 H ( t) C l3 H ( t ¡ 1) C s1 Z1 ( t) , O ( t) D H ( t) C s2 Z2 ( t) ,
(9.5) (9.6)
where Z2 (1) , Z2 (2) , . . . is a sequence of zero mean, unit variance, independent gaussian random variables, and s2 > 0 is also adaptive. The two models were trained on the heart expansion sequence displayed in Figure 6. After training, the two models were run with s D 0 to see the kind of process they had learned. The two linear models learned to oscillate at the correct frequency, but their motion was too damped (see Figure 6). It should be noted that both models are perfectly capable of learning undampened oscillations; they just could not learn this particular kind of oscillation. Unfortunately, the original heart expansion data are not available. Instead, we recovered the data by digitizing a gure from the original article and automatically sampling it at 872 points. We tried the two linear systems on these data and obtained results identical to those reported in Jacob et al. (1998), so we feel the digitization process worked well. We then trained a diffusion network with one observation unit and one hidden unit. We did not include time delays and instead allowed the network to develop its own delayed representations by fully coupling the observation and hidden units. The network was trained using 60 iterations of the stochastic EM algorithm, with 10 hidden samples per iteration. Table 2 shows the parameters learned by the network. Figure 6 shows a typical sample path produced by the trained network. Note that the network did not exhibit the dampening problem found with the linear models. We also tried a diffusion network with one observation unit and one hidden unit, but with linear activation functions instead of logistic. The result was again a very dampened process, like the one depicted in Figure 6. It appears that the saturating nonlinearity of diffusion networks was benecial for learning this task. 9.3 Example 3: Learning to Read Lips. In this section, we report on the use of diffusion networks with thousands of parameters for a sequence classication task involving a body of realistic data. We compare a diffusion network approach with the best hidden Markov model approaches published in the literature for this task. The question at hand is whether diffusion networks may be a viable alternative to hidden Markov models for sequence recognition problems. The main difference between diffusion networks and
A Monte Carlo EM Approach
1535
Figure 6: (Top left) The evolution of the rst PCA coefcient for a beating heart. (Top right) The path learned by a second-order linear model with no hidden units. (Bottom left) The path learned by a second-order linear model with hidden units. (Bottom right) Typical path learned by a diffusion network.
Table 2: Parameters Learned for the Heart Beat Problem. woo D 1.280 who D ¡1.218
woh D 0.673 whh D 0.088
jo D ¡0.143 jh D 0.001
k o D 0.709 k h D 1.087
Notes: All the weights and biases were initialized to zero and the k terms to 1. The xed parameters were set to the following values: h1 D ¡1, h2 D 2, h3 D 7, s D 0.2.
hidden Markov models is the nature of the hidden states: diffusion networks use continuous-state representations, while hidden Markov models use discrete-state representations. 15 It is possible that continuous-state representations may be benecial for modeling some natural sequences. 15
While HMMs can use continuous observations, the hidden states are always discrete.
1536
Javier R. Movellan, Paul Mineiro, and R. J. Williams
Our approach to sequence recognition using diffusion networks is similar to the approach used in the hidden Markov model literature. First, several diffusion networks are independently trained with samples of sequences from each of the categories at hand. For example, if we want to discriminate between c categories of image sequences, we would rst train c different diffusion networks. The rst network would be trained with examples of category 1, the second network with examples of category 2, and the last network with examples of category c. This training process results in c values of the parameter l, each of which has been optimized to represent a different category. We represent these values as l¤1 , . . . , l¤c . Once the networks are trained, we can classify a new observed sequence xo as follows: we compute l¤ log LO o i ( xo ) for i D 1, . . . , c. These log likelihoods are combined with the logprior probability of each category, and the most probable category of the sequence is chosen. 9.3.1 Training Database. We used Tulips1 (Movellan, 1995), a database consisting of 96 movies of nine male and three female undergraduate students from the Cognitive Science Department at the University of California, San Diego. For each student, two sample utterances were taken for each of the digits “one” through “four” (see Figure 7). The database is challenging due to variability in illumination, gender, ethnicity of the subjects, and position and orientation of the lips. The database is available at J. R. M.’s Web site. 9.3.2 Visual Processing. We used a 2 £ 2 factorial experimental design to explore the performance of two different image processing techniques (contours and contours plus intensity) in combination with two different recognition engines (hidden Markov models and diffusion networks). The image processing was performed by Luettin, Thacker, and Beet (1996a, 1996b). They employ point density models, where each lip contour is represented by a set of points; in this case, both the inner and outer lip contour are represented, corresponding to Luettin’s double contour model (see Figure 7). The dimensionality of the representation of the contours was reduced using principal component analysis. For the work presented here, 10 principal components were used to approximate the contour, along with a scale parameter that measured the pixel distance between the mouth corners; associated with each of these 11 components was a corresponding delta component. The value of the delta component for the frame sampled at time tk equals the value of the original component at that time minus the value of the original component at the previous sampling time tk¡1 (these delta components are dened to be zero at t0 ). In this manner, 22 components were used to represent lip contour information for each still frame. These 22 components were represented using diffusion networks with 22 observation units, one per component.
A Monte Carlo EM Approach
1537
Figure 7: Example images from the Tulips1 database.
We also tested the performance of a representation that used intensity information in addition to contour shape information. We obtained Luettin et al. (1996b) representations in which gray-level values are sampled along directions perpendicular to the lip contour. These local gray-level values are then concatenated to form a single global intensity vector that is compressed using the rst 20 principal components. There were 20 associated delta components, for a total of 40 components of intensity information per still frame. These 40 components were concatenated with the 22 contour components, for a total of 62 components per still frame. These 62 components were represented using diffusion networks with 62 observation units, one per component. Thus, the total number of weight parameters was (62 C (n ¡ d ) ) 2 , where n ¡ d is the number of hidden units. We tested networks with up to 100 hidden units—26,244 parameters. 9.3.3 Training. We independently trained four diffusion networks to approximate the distributions of lip-contour trajectories of each of the four words to be recognized; the rst network was trained with examples of the word one and the last network with examples of the word four. Each network had the same number of nodes, and the drift of each network was given by equation 1.2 with k D 1, r D 0, and a hyperbolic tangent activation function—h1 D ¡1, h2 D 2, h3 D 1 in equation 1.3. The connectivity matrix w
1538
Javier R. Movellan, Paul Mineiro, and R. J. Williams
and the bias parameters j were adaptive. The initial state of the hidden units was set to (1, . . . , 1) 0 with probability 1, and s was set to 1 for all networks. The diffusion network dynamics were simulated using the forward Euler technique described in previous sections. In our simulations, we set D t D 1 / 30 seconds, the time between video frame samples. Each diffusion network was trained with examples of one of the four digits using the following cost function (see equation 4.2), W (l) D
X i
log LO lo ( xio ) ¡ Y (l) ,
(9.7)
where each xio is a movie from the Tulips1 database— a sample from the desired empirical distribution Po , and Y (l) D 12 a|l| 2 for a > 0. Several values of a were tried, and performance is reported with the optimal values. Here, Y (l) acts as a gaussian prior on the network parameters. Training was done using 20 hidden sample paths per observed path. These paths were sampled using the “teacher forcing” approach described in equations 5.8 through 5.10. 9.3.4 Results. The bank of diffusion networks was evaluated in terms of generalization to new speakers. Since the database is small, generalization performance was estimated using a jackknife procedure (Efron, 1982). The four models (one for each digit) were trained with labeled data from 11 subjects, leaving a subject out for generalization testing. Percentage correct generalization was then tested using the decision rule D ( xo ) D arg max
i2f1,2,3,4g
l¤
l¤
log LO o i (xo ) ,
(9.8)
where log LO o i is the estimate of the log likelihood of the test sequence evaluated at the optimal parameters l¤i found by training on examples of digit i. This rule corresponds to assuming equal priors for each of the four categories under consideration. The entire procedure was repeated 12 times, each time leaving a different subject out for testing, for a total of 96 generalization trials (4 digits £ 12 subjects £ 2 observations per subject). This procedure mimics that used by Luettin et al. (1996a, 1996b; Luettin, 1997) to test hidden Markov model architectures. We tested performance using a variety of architectures, some including no hidden units and some with up to 100 hidden units. Best generalization performance was obtained using four hidden units. These generalization results are shown in Table 3. The hidden Markov model results are those reported in Luettin et al. (1996a, 1996b) and Luettin (1997). The only difference between their approach and ours is the recognition engine, which is a bank of hidden Markov models in their case and a bank of diffusion networks in our case. The image representations mentioned above were optimized
A Monte Carlo EM Approach
1539
Table 3: Generalization Performance on the Tulips1 Database. Approach Best HMM, shape information only Best diffusion network, shape information only Untrained human subjects Best HMM, shape and intensity information Best diffusion network, shape and intensity information Trained human subjects
Correct Generalization 82.3% 85.4 89.9 90.6 91.7 95.5
Notes: Shown in order are the performance of the best-performing HMM from Luettin et al. (1996a, 1996b) and Luettin (1997) which uses only shape information; the best diffusion network obtained using only shape information; the performance of untrained human subjects (Movellan, 1995); the HMM from Luettin’s thesis (Luettin, 1997) which uses both shape and intensity information; the best diffusion network obtained using both shape and intensity information; and the performance of trained human lip readers (Movellan, 1995).
by Luettin et al. to work with hidden Markov models. They also tried a variety of hidden Markov model architectures and reported the best results obtained with them. In all cases, the best diffusion networks outperformed the best hidden Markov models reported in the literature using exactly the same visual preprocessing. The results show that diffusion networks may outperform hidden Markov model approaches in sequence recognition tasks. While these results are very promising, caution should be exercised since the database is relatively small. More work is needed with larger databases. 10 Conclusion We think the main reason that recurrent neural networks have not worked well in practical applications is that they rely on a very simplistic likelihood model: white noise added to a deterministic sequence. This model cannot cope well with known issues in natural sequences like bifurcations and dynamic time warping. We proposed adding noise to the activation dynamics of recurrent neural networks. The resulting models, which we called diffusion networks, can handle bifurcations and dynamic time warping. We presented a general framework for learning path probability densities using continuous stochastic models and then applied the approach to train diffusion neural networks. The approach allows training networks with hidden units, nonlinear activation functions, and arbitrary connectivity. Interestingly, the gradient of the likelihood function can be computed using Hebbian coactivation statistics and does not require backpropagation of error signals. Our work was inspired by the rich literature on continuous stochastic ltering and recurrent neural networks. The idea was to combine the versatil-
1540
Javier R. Movellan, Paul Mineiro, and R. J. Williams
ity of recurrent neural networks and the well-known advantages of stochastic modeling approaches. The continuous-time nature of the networks is convenient for data with dropouts or variable sample rates, since the models we use dene all the nite dimensional distributions. The continuous-state representation is well suited to problems involving continuous unobservable quantities, as in visual tracking tasks. In particular, the diffusion approach may be advantageous for problems in which continuity and sparseness constraints are useful. Diffusion networks naturally enforce a sparse state transition matrix (there is an innite number of states, and given any state, there is a very small volume of states to move into with nonnegligible probability). It is well known that enforcing sparseness in state transition matrices is benecial for many sequence recognition problems (Brand, 1998). Diffusion networks naturally enforce continuity constraints in the observable paths, and thus they may not have the well-known problems encountered when hidden Markov models are used as generative models of sequences (Rabiner & Juang, 1993). We presented simulation results on realistic sequence modeling and sequence recognition tasks with very encouraging results. While the results were highly encouraging, the databases used were relatively small, and thus the current results should be considered only as exploratory. It should also be noticed that there are situations in which hidden Markov models will be preferable to diffusions. For example, it is relatively easy to compose hierarchies of simple trainable hidden Markov models, capturing acoustic, syntactic, and semantic constraints. At this point, we have not explored how to compose trainable hierarchies of diffusion networks. Another disadvantage of diffusion networks relative to conventional hidden Markov models is training speed, which is signicantly slower for diffusion networks than for hidden Markov models. However, once a network was trained, the computation of the density functions needed in recognition was fast and could be done in real time. Signicant work remains to be done. In this article, we assume that the data are a continuous stochastic process. However, in many applications, the data take the form of discrete-time samples from a continuous-time process. In this article, we approached this issue by using simple discrete-time Euler approximations to stochastic integrals. In the future, we plan to investigate alternative approximations that allow for path-wise convergence, such as that in Karandikar (1995). The theoretical properties of the stochastic EM procedure when combined with discrete-time approximations of stochastic integrals remain to be investigated. In this article, we did not consider dispersions that are a function of the state, and we have not optimized the dispersion and the initial distribution of hidden states. Optimization of the initial distribution of hidden states is easy but unnecessarily obscures the presentation of the main issues in the article. Optimization of the dispersion is easy in discrete-time systems but presents mathematical challenges in continuous-time systems; thus, we have deferred the issue for future work. Theoretical issues concerning the approximating power of diffusion
A Monte Carlo EM Approach
1541
networks need to be explored. In deterministic neural networks, it is known that with certain choices of activation function and sufciently many hidden units, neural networks can approximate a large set of functions with arbitrary accuracy (Hornik, Stinchombe, & White, 1989). An analogous result for diffusion networks stating the class of distributions that can be approximated arbitrarily closely would be useful. It is also important to compare the properties of the algorithm presented here with alternative learning algorithms for partially observable stochastic processes that could also be applied to diffusion networks (Campillo & Le Gland, 1989; Blake et al., 1999). Another aspect of theoretical interest is the potential connection between diffusion networks and recent kernel-based approaches to learning. In particular, notice that the learning rule developed in this article depends on a matrix of inner products in the Hilbert space of squared integrable functions, as dened in equation 7.8. This may allow application of standard kernel methods (Aizerman et al., 1964; Burges, 1998) to the problem of learning distributions of sequences. We are currently exploring applications of diffusion networks to realtime stochastic ltering problems (face tracking) and sequence-generation problems (face animation). Our work shows that diffusion networks may be a feasible alternative to hidden Markov models for problems in which state continuity and sparseness in the state transition distributions are advantageous. The results obtained for the visual speech recognition task are encouraging and reinforce the possibility that diffusion networks may become a versatile general-purpose tool for a wide variety of continuous-signal processing tasks. Acknowledgments We thank Anthony Gamst, Ian Fasel, Tim Kalman Marks, and David Groppe for helpful discussions and Juergen Luettin for generously providing lip contour data and technical assistance. The research of R. J. W. was supported in part by NSF grant DMS 9703891. J. R. M. was supported in part by the NSF under grant 0086107. References Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltzmann machines. Cognitive Science, 9, 147–169. Aizerman, M. A., Braverman, E. M., & Rozoner, L. (1964). Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 15, 821–837. Billingsley, P. (1995). Probability and measure. New York: Wiley. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press.
1542
Javier R. Movellan, Paul Mineiro, and R. J. Williams
Blake, A., North, B., & Isard, M. (1999). Learning multi-class dynamics. In M. S. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 389–395). Cambridge, MA: MIT Press. Brand, M. (1998). Pattern discovery via entropy minimization. In D. Heckerman & J. Whittaker (Eds.), Proceedings of the Seventh International Workshop on Articial Intelligence and Statistics. San Mateo, CA: Morgan Kaufmann. Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 121–167. Campillo, F., & Le Gland, F. (1989).MLE for partially observed diffusions—direct maximization vs. the EM algorithm. StochasticProcessesand Their Applications, 33(2), 245–274. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc., 39, 1–38. Efron, A. (1982).The jacknife, the bootstrapand other resampling plans. Philadelphia: SIAM. Fahrmeir, L. (1992). Posterior mode estimation by extended Kalman lter for multivariate dynamics generalized linear models. Journal of the American Statistical Association, 87, 501–509. Fishman, G. S. (1996). Monte Carlo sampling: Concepts, algorithms,and applications. New York: Springer-Verlag. Ghahramani, Z., & Roweis, S. T. (1999). Learning nonlinear dynamical systems using an EM algorithm. In M. S. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11. Cambridge, MA: MIT Press. Hertz, J., Krogh, A., & Palmer, R. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley. Hopeld, J. (1984).Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Science, 81, 3088–3092. Hornik, K., Stinchombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 259–366. Jacob, G., Noble, A., & Blake, A. (1998). Robust contour tracking of electrochardiographic sequences. In Proc 6th Int. Conf. on Computer Vision (pp. 408–413). Kalman, R. E., & Bucy, R. S. (1961). New results in linear ltering and prediction theory. Transactions ASME J. of Basic Eng., 83, 95–108. Bombay: IEEE Computer Society Press. Karandikar, R. L. (1995). On pathwise stochastic integration. Stochastic Processes and Their Applications, 57, 11–18. Karatzas, I., & Shreve, S. E. (1991). Brownian motion and stochastic calculus. New York: Springer-Verlag. Kitagawa, G. (1996). Monte Carlo lter and smoother for non-gaussian nonlinear state space models. Journal of Computational and Graphical Statistics, 5(1), 1–25. Kloeden, P. E., & Platen, E. (1992). Numerical solutions to stochastic differential equations. Berlin: Springer-Verlag. Levanony, D., Shwartz, A., & Zeitouni, O. (1990). Continuous-time recursive estimation. In E. Arikan (Ed.), Communication, control, and signal processing. Amsterdam: Elsevier.
A Monte Carlo EM Approach
1543
Lewis, F. L. (1986). Optimal estimation—with an introduction to stochastic control theory. New York: Wiley. Ljung, L. (1999). System identication: Theory for the user. Upper Saddle River, NJ: Prentice-Hall. Luettin, J. (1997). Visual speech and speaker recognition. Unpublished doctoral dissertation, University of Shefeld. Luettin, J., Thacker, N., & Beet, S. (1996a). Statistical lip modelling for visual speech recognition. In Proceedings of the VIII European Signal Processing Conference. Trieste, Italy. Luettin, J., Thacker, N. A., & Beet, S. W. (1996b). Speechreading using shape and intensity information. In Proceedings of the Internal Conference on Spoken Language Processing. Philadelphia: IEEE. McClelland, J. L. (1993). Toward a theory of information processing in graded, random, and interactive networks. In D. E. Meyer & S. Kornblum (Eds.), Attention and performance XIV: Synergies in experimental psychology, artical intelligence, and cognitive neuroscience (pp. 655–688). Cambridge, MA: MIT Press. Meinhold, R. J., & Singpurwalla, N. D. (1975). Robustication of Kalman lter models. IEEE Transactions on Automatic Control, 84, 479–486. Mineiro, P., Movellan, J. R., & Williams, R. J. (1998). Learning path distributions using nonequilibrium diffusion networks. In M. Kearns (Ed.), Advances in neural information processing systems, 10 (pp. 597–599). Cambridge, MA: MIT Press. Movellan, J. (1994). A local algorithm to learn trajectories with stochastic neural networks. In J. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 83–87). San Mateo, CA: Morgan Kaufmann. Movellan, J. (1995). Visual speech recognition with stochastic neural networks. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7. Cambridge, MA: MIT Press. Movellan, J. R. (1998). A learning theorem for networks at detailed stochastic equilibrium. Neural Computation, 10(5), 1157–1178. Movellan, J., & McClelland, J. L. (1993). Learning continuous probability distributions with symmetric diffusion networks. Cognitive Science, 17, 463–496. Movellan, J. R., & McClelland, J. L. (2001). The Morton-Massaro law of information integration: Implications for models of perception. Psychological Review, 1, 113–148. North, B., & Blake, A. (1998). Learning dynamical models by expectation maximization. In Proc 6th Int. Conf. on Computer Vision (pp. 911–916). Bombay: IEEE Computer Society Press. Oksendal, B. (1998). Stochastic differential equations: An introduction with applications (5th ed.). Berlin: Springer-Verlag. Pearlmutter, B. A. (1995). Gradient calculations for dynamic recurrent neural networks: A survey. IEEE Transactions on Neural Networks, 6(5), 1212–1228. Poor, H. V. (1994). An introduction to signal detection and estimation. Berlin: Springer-Verlag. Protter, P. (1990). Stochastic integration and differential equations. Berlin: SpringerVerlag.
1544
Javier R. Movellan, Paul Mineiro, and R. J. Williams
Rabiner, L. R., & Juang, B.-H. (1993). Fundamentals of speech recognition. Englewood Cliffs, NJ: Prentice Hall. Sage, A. P., & Melsa, J. L. (1971).Estimationtheory with applicationto communication and control. New York: McGraw-Hill. Shumway, R., & Stoffer, D. (1982). An approach to time series smoothing and forecasting using the EM algorithm. J. Time Series Analysis, 3, 253–265. Solo, V. (2000). Unobserved Monte-Carlo method for identication of partially observed nonlinear state space systems, Part II: Counting process observations. In Proc. IEEE Conference on Decision and Control. Sidney, Australia. White, H. (1996). Estimation, inference and specication analysis. Cambridge: Cambridge University Press. Zipser, D., Kehoe, B., Littlewort, G., & Fuster, J. (1993). A spiking network model of short-term active memory. Journal of Neuroscience, 13(8), 3406–3420. Received July 2, 1999; accepted November 15, 2001.
NOTE
Communicated by Klaus Obermayer
Are Visual Cortex Maps Optimized for Coverage?  Carreira-Perpi n Miguel A. ˜ an
[email protected] Geoffrey J. Goodhill
[email protected] Department of Neuroscience, Georgetown University Medical Center, Washington, D.C. 20007, U.S.A.
The elegant regularity of maps of variables such as ocular dominance, orientation, and spatial frequency in primary visual cortex has prompted many people to suggest that their structure could be explained by an optimization principle. Up to now, the standard way to test this hypothesis has been to generate articial maps by optimizing a hypothesized objective function and then to compare these articial maps with real maps using a variety of quantitative criteria. If the articial maps are similar to the real maps, this provides some evidence that the real cortex may be optimizing a similar function to the one hypothesized. Recently, a more direct method has been proposed for testing whether real maps represent local optima of an objective function (Swindale, Shoham, Grinvald, Bonhoeffer, & Hubener, ¨ 2000). In this approach, the value of the hypothesized function is calculated for a real map, and then the real map is perturbed in certain ways and the function recalculated. If each of these perturbations leads to a worsening of the function, it is tempting to conclude that the real map is quite likely to represent a local optimum of that function. In this article, we argue that such perturbation results provide only weak evidence in favor of the optimization hypothesis.
1 Introduction Neurons in visual cortex respond to several kinds of visual stimuli, the best studied of which include position in visual eld, eye of origin, and orientation, direction, and spatial frequency of a grating. The pattern of preferred stimulus values over the whole visual cortex for each kind of stimulus is called a (visual) cortical map. Thus, maps of visual eld position, ocular dominance, orientation, and so forth coexist on the same neural substrate. Given that these maps show a highly organised spatial structure, the question arises of what underlying principles explain these maps. Two such principles are coverage uniformity, or completeness, and continuity, or similarity (Hubel & Wiesel, 1977). Coverage uniformity means that each combination c 2002 Massachusetts Institute of Technology Neural Computation 14, 1545–1560 (2002) °
1546
Miguel A. Carreira-Perpin ˜ an and Geoffrey J. Goodhill
of stimuli values (e.g., any orientation in any visual eld location of either eye) has equal representation in the cortex; completeness means that any combination of stimuli values is represented somewhere in cortex. Thus, coverage uniformity implies completeness (disregarding the trivial case of a cortex uniformly nonresponsive to stimuli), but not vice versa, since it is possible to have over- and underrepresented stimuli values (in addition, it is not practically possible to represent all values of a continuous higherdimensional stimulus space with a continuous two-dimensional cortex). A useful middle ground is to consider that the set of stimulus values represented by the cortex be roughly uniformly scattered in stimulus space. A common qualitative denition of continuity is that neurons that are physically close in cortex tend to have similar stimulus preferences; this can be motivated in terms of economy of cortical wiring (Durbin & Mitchison, 1990). Coverage and continuity compete with each other. If, say, retinotopy and preferred orientation vary slowly from neuron to neuron, sizable visual eld regions will lack some orientations. If neurons’ preferred stimuli values are scattered like a salt and pepper mixture, continuity is lost. The striped structure of several of the maps can be seen as a compromise between these two extremes. An early model based on these principles is the ice cube model of Hubel and Wiesel (1977), where stripes of ocular dominance run orthogonally to stripes of orientation and all combinations of eye and orientation preference are represented within a cortical region smaller than a cortical point image (the collection of neurons whose receptive elds contain a given visual eld location). The competition can be explained in a dimensionreduction framework, where a two-dimensional cortical sheet twists in a higher-dimensional stimulus space to cover it as uniformly as possible while minimizing some measure of continuity. Optimization models based on such principles produce maps with a quantitatively good match to the observed phenomenology of cortical maps, including the striped structure of ocular dominance and orientation columns with appropriate periodicity and interrelations (Erwin, Obermayer, & Schulten, 1995; Swindale, 1996). However, a more direct approach to test the validity of such optimization models would be to calculate the value of the objective function for a real cortical map and then determine by perturbation whether this represents a local optimum. Such an approach has recently been proposed by Swindale, Shoham, Grinvald, Bonhoeffer, & Hubener ¨ (2000). Although the results they presented are consistent with the hypothesis that real maps are optimized for a particular function measuring coverage, here we argue that these results offer only weak evidence in favor of the hypothesis. 2 The Coverage Measure Consider a resolution-dependent representation of a cortical map dened as a two-dimensional array of vector values of the stimulus variables of interest. Each position ( i, j) in the array represents an ideal cortical cell;
Are Visual Cortex Maps Optimized for Coverage?
1547
call C the set of all such cortical positions. There is a vector of stimulus values ¹ij associated with each cortical position ( i, j) ; stimulus variables considered by Swindale et al. (2000) are the retinotopic position (or receptive eld center in the visual eld) ( x, y ) in degrees, the preferred orientation h 2 [0± , 180± ) , the ocular dominance n (¡1: left eye, C 1: right eye), and def the spatial frequency m 2 f¡1, 1g. Therefore, ¹ij D ( nij , mij , hij , xij , yij ) for ( i, j) 2 C can be considered a generalized receptive eld center; a receptive eld would then be dened by a function sitting on the receptive eld center def and monotonically decreasing away from it (see below). The collection M D f¹ij g ( i, j)2C of such receptive eld centers, together with the two-dimensional ordering of cortical positions in C , denes the cortical map. A mathematically convenient way of representing the trade-off between the goals of attaining uniform coverage and respecting the constraints of cortical wiring is to assume that cortical maps maximize a function def F (M) D C (M) C lR (M) ,
(2.1)
where C is a measure of the uniformity of coverage, R is a measure of the continuity, and l > 0 species the relative weight of R with respect to C . We assume that maximizing either C or R separately does not lead to a maximum of F and therefore that maxima of F imply compromise values of C and R. The exact form of the combination of C and R of equation 2.1 (a weighted sum) need not be biologically correct, but for the purposes of embodying the competition between C and R, it is sufcient. Swindale (1991) introduced the following mathematical denition of coverage. Given an arbitrary stimulus v, the total amount of cortical activity that it produces is dened as def
A (v) D
X ( i, j )2C
f (v ¡ ¹ij ) ,
(2.2)
where f is the (generalized) receptive eld of cortical location ( i, j) , assumed translationally invariant (so it depends on only the difference of stimulus v and generalized receptive eld center ¹ij ); f is taken as a product of functions: gaussian for orientation and retinotopic position (with widths derived from biological estimates of tuning curves)1 and delta for ocular dominance and spatial frequency. A is calculated for a regular grid in stimulus space, which is assumed to be a representative set of stimulus values. The measure 1 Strictly, the receptive eld size depends on the location of the stimulus in the visual eld and adapts to the surround (e.g., with contrast). However, extending the coverage denition to account for this is difcult. Therefore, in common with Swindale et al. (2000), we will consider xed receptive eld sizes.
1548
Miguel A. Carreira-Perpin ˜ an and Geoffrey J. Goodhill
of coverage uniformity is nally obtained as c0 D
def
stdev fAg , meanfAg
(2.3)
that is, the magnitude of the normalized dispersion of the total activity A in the stimulus space. Intuitively, c0 will be large when A takes different values for different stimuli and zero if A has the same value independent of the stimulus. Thus, it is a measure of lack of coverage uniformity, and we could def dene C D ¡c0 . Equation 2.2 can be seen as a generalization of the tness term of the elastic net objective function (Durbin, Szeliski, & Yuille, 1989). R is the combined effect of several factors, none of which is fully understood, and so it is hard to write down a functional form for it condently. 3 Determining Map Optimality via Perturbations If suitable functions C and R are dened, the mathematical procedure to determine whether a given map M D f¹ij g ( i, j)2C is a (local) maximum of F is to check that the gradient of F at M is zero and the Hessian of F at M is negative denite (or negative semidenite). However, there are two problems with this. First, C is obtained in an approximate way 2 using a sample of the stimulus distribution, and so the numerical accuracy of the gradient and Hessian will be affected by a discretization error, particularly if the sample is coarse and symmetric. But second and crucially, even if we commit ourselves to a given (approximated) mathematical denition of C such as ¡c0 , we still do not have a suitable denition of R. Given the difculties in the denition of R and the mathematical treatment of C , the goal of Swindale et al. (2000) was less ambitious: to check whether the maps are at local optima of C by examining the effect on C of a fairly small set of perturbations of the maps that hopefully would not affect R, however the latter is dened. That is, they argued that although we do not know what R is exactly, we may be able to determine what perturbations of a map should leave R unaffected. Specically, they suggested rigid motion perturbations (horizontal translations, 180 degree rotations, and horizon2 Although we have a mathematical denition of C D ¡c0 in terms of M (via the intermediate denition of A(vI M)), it is not possible to obtain C as an explicit function of M for a given distribution of the stimulus v (e.g., uniform) since one cannot analytically determine the distribution of A. Thus, C (M) must be approximated with a sample of the stimulus distribution. Swindale et al. (2000) computed the total cortical activity A for all combinations of n 2 f¡1, 1g, m 2 f¡1, 1g, h 2 f0± , 30± , 60± , 90± , 120± , 150± g, x 2 f1, 5, 9, . . . , imax g, and y 2 f1, 5, 9, . . . , jmax g, where imax and jmax are the size in pixels of the rectangular map, amounting to about 14, 000 stimuli. This is a coarse and symmetric sample of h, x, and y. Drawing a ner random sample from a uniform distribution in stimulus space could avoid potential artifactual estimates and could also be used to control whether different samples lead to essentially the same value of A.
Are Visual Cortex Maps Optimized for Coverage?
1549
tal and vertical ips) applied separately to each individual map (discussed further in section 3.3). If such perturbations unambiguously worsened coverage uniformity for biologically observed maps of developed animals, it would be tempting to conclude that such maps are local maxima of both C and F . To test this idea, Swindale et al. (2000) used empirical maps of ocular dominance, orientation, and spatial frequency obtained simultaneously in area 17 of the cat using standard optical imaging methods for young animals. After some preprocessing (including smoothing, necessary to remove noise), rectangular regions of about 5 £ 2.5 mm (approximately 140 £ 70 pixels) were obtained in which each pixel has associated values of ocular dominance in f¡1, 1g, orientation in degrees in [0, 180) , and spatial frequency in f¡1, 1g. Since optical imaging provides no information about topography, they chose to make the retinotopic map linear, that is, perfectly topographic,3 which makes coverage uniform by denition along the retinotopic variables x, y. Swindale et al. (2000) then computed the variation of c0 for a range of rigid-motion perturbations and found that coverage became less uniform for most of the perturbations described, often as an increasing function of the size of the perturbation (notably for horizontal shifts). The lack of negative results in the perturbation simulations led Swindale et al. (2000) to argue that such maps are local maxima of both C and F —or as Das (2000) put it in an associated article, “Real experimentally obtained maps from V1 are indeed optimally arranged with respect to each other such that any departure from the real maps worsens the coverage.” However, there are several reasons to question whether this really follows from the results presented by Swindale et al. 3.1 Incompleteness of the Perturbation Set. The set of perturbations used by Swindale et al. (2000) does not include all possible perturbations that would leave a certain continuity function R unchanged. Assume that all stimulus variables are continuous, and call D the number of (independent) such variables, that is, the number of scalar variables in M D f¹ij g ( i, j)2C . This is a large number: for a rectangular map of 140£70 with 5 stimulus variables, D is 49, 000, and it could even be innite if one considers a nonparametric representation of the map (in which case we would have a variational problem). Then an elementary perturbation of a map M that does not alter R will result in a perturbed map lying at an 2 -distance from M on the manifold def R (M) D fN: R (N) D R (M) g (see Figure 1). Such a manifold will have dimension D ¡1 (or less, if some variables are dependent). Therefore, if D > 2, the number of different directions in which to perturb the map is innite. In other words, proving that M is a maximum of C for xed R requires proving
3 The effect on the coverage estimates of assuming that the retinotopic map is strictly topographic is likely to be considerable. Besides, this assumption may not be true in reality (e.g., see Das & Gilbert, 1997).
1550
Miguel A. Carreira-Perpin ˜ an and Geoffrey J. Goodhill
Figure 1: Illustration of an elementary neighborhood of M contained in the def manifold R (M) D fN: R (N) D R (M) g. In this example, the map space is threedimensional, and the manifold is two-dimensional. The thick lines indicate manifolds along which some specic classes of perturbations leave R constant. These are just a subset of all perturbations that leave R constant (the dark-shaded neighborhood). B 2 (M) is a ball of radius 2 centered at M.
that every perturbed map in an elementary neighborhood of M contained in the mentioned manifold has a lower value of C . Such a neighborhood can be specically dened as the intersection of a ball B2 (M) of small radius 2 > 0 and the manifold R (M) , as shown in Figure 1, and contains an innite number of maps. Therefore, a procedure based on trying a nite number of different perturbations can never prove the statement, although it can disprove it by nding one such perturbation that increases C (assuming it is possible to implement an elementary perturbation numerically). No matter how many perturbations we try that decrease C , we can never be sure that the rest of them will as well. Even if a candidate map passes a seemingly convincingly large number of perturbations, there are still many more perturbations to test, and there are many other candidates that would pass those perturbations too. Therefore, a “mathematically complete exploration of a full range of distortions” (Das, 2000) is an unattainable goal. The elementary perturbations we have mentioned include not just (vanishingly small) rigid motions in all possible directions, but also different classes of perturbations, such as “rubber-sheet” distortions of the map. Both Swindale et al. (2000) and Das (2000) admit the possibility that rubbersheet distortions exist that improve coverage uniformity while leaving R unchanged. However, Swindale et al. argue that since R is not properly dened, a given distortion that leaves unchanged a function R1 dened in some way would likely change a function R2 dened in a different way. Besides the fact that the same argument could be equally applied to the definition of coverage uniformity (a distortion decreasing coverage uniformity
Are Visual Cortex Maps Optimized for Coverage?
1551
as c0 does might increase coverage uniformity under a different denition), our argument remains, since for any given denition of R, there potentially exist distortions that could increase C , that is, the neighborhood dened earlier depends on the chosen denition of R, but it always exists. These issues are illustrated more concretely in Figure 2. This presents an example of a very simplied version of the mapping problem investigated by Swindale et al. (2000) akin to a one-dimensional ocular dominance problem (Goodhill & Willshaw, 1990). A particular map is hypothesized to be optimal and is then perturbed. We show that rigid shifts of this map decrease C D ¡c0 as a monotonically decreasing function of the size of the shift. However, we also show that rubber-sheet perturbations of this map that leave R unchanged can increase C . 3.1.1 Checking for Stationary Points. Could one at least determine whether the maps are at a stationary point of C for xed R, that is, whether the gradient of C at the map M is zero in the direction tangent to the manifold R (M) ? A numerical approximation to the gradient of a function f of D variables can be computed by nite differences by using small perturbations along D linearly independent directions4 —for example, along the coordinate axes (with unit vectors e1 , . . . , eD ), by computing @f @xd
¼
f (x C 2 ed ) ¡ f (x) 2
.
This would require f to be computed at x C 2 ed for d D 1, . . . , D, that is, D component-wise small perturbations (for comparison, the shifts of Swindale et al., 2000, would amount to only six perturbation dimensions in a space of D D 49, 000). Whether r f (x) D 0 could then be determined, at least up to some numerical threshold (which may not be a straightforward matter), and thus whether x is a stationary point of f . However, r f (x) D 0 is true for saddle points as well as optima, and high-dimensional multivariate functions that have many optima typically have far more saddle points (see the appendix).5 Figure 3 shows an example of a function of two variables with a stationary point at the origin that is neither a maximum nor a minimum, but perturbing a point at the origin along many directions will result in a lower value of the function. Thus, knowing that the gradient is zero and nding that the high-dimensional function decreases along a few directions in no way guarantees that it is at a maximum.
4
For this and other statements in this section, see a text on optimization, e.g., Nocedal and Wright (1999). 5 That the coverage and continuity functions must have many local optima follows from symmetry considerations and from the fact that, assuming the maps are indeed optimal, while the maps of any two normal animals (of the same species) are qualitatively similar to each other, no two animals have the same map.
1552
Miguel A. Carreira-Perpin ˜ an and Geoffrey J. Goodhill
Are Visual Cortex Maps Optimized for Coverage?
1553
To determine whether the stationary point is a maximum (say), one would need to compute a numerical approximation of the Hessian (the matrix of second-order derivatives) and check that it is negative denite (or positive denite, for a minimum)—that all its eigenvalues are strictly negative. This is now a much harder numerical problem than determining whether r f (x) D 0: estimating the Hessian requires O ( D2 ) perturbations, rather than O ( D) as for the gradient, and even if it could be estimated, the real problem is then determining whether its eigenvalues are negative (computationally an O ( D3 ) problem). This is very difcult because it is well known that the Hessian of a function of many variables is likely to be ill conditioned: the ratio of the smallest to largest eigenvalue (in absolute value)
Figure 2: Facing page. A nonoptimal map that shows a systematic worsening of coverage uniformity on shifts but a systematic improvement on rubber-sheet perturbations. In this thought experiment, reminiscent of an elastic net (Durbin & Willshaw, 1987), the array of empty circles represents a uniform sample in a two-dimensional stimulus space (e.g., the horizontal axis could be a retinotopic variable and the vertical axis the ocular dominance as in Goodhill & Willshaw, 1990). The string of lled circles (receptive eld centers) represents a one-dimensional map that tries to cover the stimuli as much and as uniformly def
as possible (measured by C D ¡c0 as in section 2) while respecting the map continuity as much as possible (here dened as the sum R of the lengths of the individual segments). To compute c0 , in equation 2.2, a gaussian kernel f with a standard deviation equal to twice the radius of the shaded disks was used. Map B is the result of rigidly shifting map A to the right: it has the same length R as map A but a lower value of C . In general, rigidly shifting map A horizontally systematically decreases C and results in an inverted- U curve for C, wrongly suggesting that map A is an optimum. The same happens for vertical shifts, as in map C (although, for this particular example, map A is slightly off the maximum of C). However, map A can be stretched and compressed symmetrically (a rubber-sheet transformation) to reach map D, keeping its length R constant. Thus, map A is not optimal, and there is a continuous path inside the manifold of constant R that monotonically increases C until map D is reached, as shown in map D. Whether map A is a saddle point of F (M) D C (M) C lR (M) or lies on an inclined ridge depends on the actual values of l and the parameters of C and R. In all graphs in the right column, the vertical scale is the same; the steepness of the curves may be increased or decreased by changing the gaussian kernel width, and the curves were computed with a Matlab program from the actual data and denitions of C and R. Note that if one considers periodic boundary conditions in the horizontal axis, by symmetry c0 becomes exactly 0 for both maps A and D, even though intuitively map D is better than map A. This is a shortcoming of the denition of c0 ; the tness term of, for instance, the elastic net objective function does differentiate between maps A and D.
1554
Miguel A. Carreira-Perpin ˜ an and Geoffrey J. Goodhill 2
3 2
1
1 0 1
0
2 3 2
1
1 0 1 2
2
1
0
1
2 2 2
1
0
1
2
Figure 3: A function with a stationary point at the origin (M D 0) that is a saddle point: (left) surface plot; (right), contour plot. The equation of the function in polar coordinates is f ( r, h ) D r2 (sin2m ( Nh ¡ a) ¡ 12 ) with m D 10, N D 3, and a D p2 . In the contour plot, the shaded areas correspond to f > 0 and the white areas to f < 0. Any straight line in the (r, h ) plane that passes through the origin is associated with either a U curve, along which the function increases away from the origin (if inside the shaded areas); an inverted- U curve, along which the function decreases away from the origin (if inside the white areas); or a horizontal line, along which the function is constant (if on the boundary). For the particular function shown, inverted- U curves are much more abundant than U curves, and so perturbations of the point M D 0 inside a small ball B2 (M) typically result in a lower value of f .
is often very close to 0 and geometrically corresponds to a direction along which f is nearly at. This is a perennial problem in multivariate optimization (e.g., backpropagation training of a multilayer perceptron), where it can be difcult to tell whether the optimization algorithm converged to an optimum or got stuck in a saddle point or even in a nonstationary point. Bentler and Tanaka (1983) report a remarkable example from the factor analysis literature (in a space of merely 36 dimensions). But the problem we really have is even harder, because we want to see whether a function C has an optimum along the manifold R (M) , not along the whole space. This means that we cannot even obtain D ¡ 1 linearly independent directions along which to perturb6 the map (D ¡ 1 being the dimension of R (M) ), as shown in section 3.2, and therefore we cannot even compute the gradient tangential to R (M). Consequently, we are not merely unable to determine whether the map is a maximum inside R (M) ; we cannot even determine whether the map is at a stationary point inside R (M) .
6 Assuming true small perturbations, not the ones that Swindale et al. (2000) used, as discussed in section 3.3.
Are Visual Cortex Maps Optimized for Coverage?
1555
3.2 Such Perturbations May Indeed Alter R. Swindale et al. (2000) argue that the perturbations they tried should not affect the continuity measure, however this may actually be dened. The justication is that presumably such a measure would depend on the Euclidean distances between stimulus values, and these distances are preserved by the class of rigid motions. However, Swindale et al. (2000) applied rigid motions to the individual maps separately,7 and this does alter the geometric relationships between the individual maps which are observed in biological maps. For example, the stripes of ocular dominance and orientation maps are known to intersect at approximately right angles, and the singularities of the orientation map are known to lie generally at the centers of the ocular dominance stripes (Bartfeld & Grinvald, 1992; Obermayer & Blasdel, 1993; Hubener, ¨ Shoham, Grinvald, & Bonhoeffer, 1997); and (although still awaiting experimental replication) the orientation discontinuities seem to be matched with retinotopic discontinuities (Das & Gilbert, 1997). If the individual maps are independently rotated or translated, these relationships are altered (or completely broken, if the perturbations are large). Such alterations of the individual map interrelations are likely to have an effect on the cortical wiring constraints and therefore on the value of R. Consequently, the fact that coverage uniformity generally decreased becomes hard to interpret, since it could be accompanied by an increase or a decrease in R. 3.3 Such Perturbations Are Not Local. So far we have used the term perturbation in its usual sense of small or elementary perturbation, whose amount is vanishingly small. For example, a small translation of the map M in the direction of a vector N could be dened as M0 D M C 2 N, or ¹0ij D ¹ij C 2 º ij 8(i, j) 2 C , where " > 0 is very small; similarly, if every ¹ij is perturbed by a different small amount, then we would have a rubber-sheet perturbation. However, the perturbations that Swindale et al. (2000) used are not small. To see this, note that their perturbations are actually permutations. Consider, for example, perturbing the orientation map while keeping the other maps xed, and assume for notational convenience that the cortex origin is at the center of the rectangular region that Swindale et al. examined: Horizontal shift of k pixels: hij takes the value of hi C k, j . Horizontal ip: hij is swapped with h¡i, j . 180 degree rotation: hij is swapped with h¡i, ¡j .
7 If the individual maps were transformed jointly by a rigid motion, we would gain no information (assuming that the cortex is homogeneous and isotropic).
1556
Miguel A. Carreira-Perpin ˜ an and Geoffrey J. Goodhill
Figure 4: Shifts of a one-dimensional bit-mapped cortex with a discontinuity.
Thus, the new value for hij will often be very different from the original one. It could be argued that for shifts of a small amount (e.g., k D 1 pixel or less), the value of hi C k, j will be very similar to that of hij , but this rests on an assumption of continuity of orientation that does not hold generally (e.g., at singularities or fractures), and the same would happen with other maps. This argument holds no matter how nely one discretizes the cortex, since the discontinuities do not go away as the pixel size goes to zero. The argument would also apply to small-angle individual map rotations, though Swindale et al. (2000) considered only 180 degree rotations. Figure 4 illustrates the idea. It shows a one-dimensional bit-mapped cortex M 0 where each pixel i contains an orientation valuehi that varies continuously except at a single point; it then shows maps M1 , M2 , . . . shifted by increasing amounts of 1, 2, . . . pixels and the respective perturbations M 0 ¡ M1 , M0 ¡ M2 , . . . of the original map. It can be seen that the effect of both the continuous region and the discontinuity is accumulative with the shift size, consistent with the U-curves reported by Swindale et al., and if the pixel size is very small, the 1 values in M0 ¡ M1 will be very small but the ¡7 will remain, and similarly for other shifts, so that the perturbation remains large. In summary, the “perturbations” of Swindale et al. (2000) are not small perturbations, but permutations that often result in very large perturbations, therefore being nonlocal. That is, the perturbed map is not in the immediate vicinity of the original one but in a faraway region of map space, and therefore comparison of the coverage values of both maps is hard to interpret. Put another way, it makes sense to add 1 degree to a given hij and see how that affects C , but not to add unknown, potentially arbitrarily large amounts to all hij ’s. Besides, if the original (observed) map was indeed at a local optimum of C , given the multiplicity of local optima in map space
Are Visual Cortex Maps Optimized for Coverage?
1557
mentioned earlier, it is to be expected that the perturbed map would then be near a completely different local optimum. 4 Conclusion The general principle that cortical maps are wired in a way that achieves uniform coverage while also minimizing cortical wiring is important for understanding cortical map structure. The abstract implementation of such a principle in cortical map models based on dimensionality reduction replicates most of the characteristics of such maps. The evidence that Swindale et al. (2000) presented is certainly consistent with the optimization hypothesis, but it does not add signicant support for it. Being based on trial and error of a subset of possible perturbations, it might disprove the hypothesis that empirical maps maximize (a certain measure of) coverage uniformity, but it cannot prove it. The lack of an appropriate denition of economy of cortical wiring prevents us from nding perturbations that leave it unchanged and ultimately prevents a quantitative assessment of the general principle stated above. What could be more easily tested is whether empirical maps are stationary points of the coverage uniformity by itself (irrespective of any connectivity constraint), but it would be surprising if this was so. Since most of the perturbations that Swindale et al. (2000) tried worsened coverage uniformity (the more so the larger the perturbation, for the horizontal shifts), it may be argued that this cannot be due to chance. In section 3, we showed that in high dimensions, this intuition about chance is misleading and that continuity is also being altered by those manipulations. In addition, the example shown in Figure 2 demonstrates that it is also possible to observe a systematic decrease in coverage uniformity for shifts when the map is not optimal. Visual cortex has an orderly columnar structure. That perturbations such as shifts should disturb that structure, and probably worsen coverage uniformity, should not come as a surprise. That any perturbation should worsen coverage uniformity is a far stronger statement that would be very difcult to conrm empirically. The work of Swindale et al. (2000) provides useful evidence that coverage is fairly uniform across visual cortex, but does not prove that it is as uniform as possible. Optimality is a very important principle in biology, and there are many examples where biological systems have been proven to achieve the best performance possible given the relevant physical constraints (e.g., Bialek, 1987). The approach in such cases is generally to calculate from rst principles what optimal performance would be and then show that biology achieves this performance. This is analogous to the standard methodology in the visual cortical map eld of hypothesizing an objective function (though on rather less certain grounds than direct physical constraints), calculating the maps that follow from this function, and comparing them with real maps. An alternative method, practical when the problem is discrete and the number of possible states is relatively small, is to calculate the value
1558
Miguel A. Carreira-Perpin ˜ an and Geoffrey J. Goodhill
Figure 5: (Left) Setup for the 3D case of the proof of the appendix.The nearest neighbors of the central point (at the origin) of rst order p lie at the diagonals of length 1 (), of second order at the diagonals of length 2 ( ), and of third order p at the diagonals of length 3 (4); they correspond to the 6 faces, 12 edges, and 8 vertices, respectively, of a cube of side 2 centered on the origin (dotted line). The midpoints of the diagonals are marked with small dots. (Right) Contour plot of a function of two variables with many maxima. The maxima are marked with ¤, minima with ±, and saddle points with £. Observe the abundance of saddle points compared to that of maxima or minima. This is because when going from one maximum to another maximum, or from one minimum to another minimum, we cross a saddle point. In higher dimensions, saddle points become even more abundant, possibly exponentially so.
of an objective function for all states and show that the biological state represents the optimum of this function (e.g., Cherniak, 1995). However, in the case discussed by Swindale et al. (2000) the number of states is innite, and we argue that a numerical perturbation approach cannot yield signicant insight into optimality. Appendix: Abundance of Saddle Points and Optima in High Dimensions Consider a linear superposition of localized spherical functions (each function falls away quickly, e.g., at a distance 12 ) each centered at the knots of a D-dimensional cubic array. That is, we place one such function at every position ( x1 , x2 , . . . , xD ) where each xd is integer, for d D 1, . . . , D. Call whole-knot every such position. By symmetry, we will have a maximum at every whole-knot and, over a large region, the same number of minima. In the midpoint of the segment joining any two maxima (minima), there must be either a saddle point or a minimum (maximum). These midpoints are located at positions ( x1 , x2 , . . . , xD ) where at least one xd is an integer plus 12 p p p and correspond to the midpoint of diagonals of lengths 2, 3, . . . , D that link a maximum with its respective nearest-neighboring maxima (see Figure 5, left). Call such midpoints half-knots. Over a hypercubic region of side
Are Visual Cortex Maps Optimized for Coverage?
1559
N in each axis, we have (2N ) D half-knots and whole-knots (saddle points, minima and maxima) and N D whole-knots (maxima). Thus, it contains N D maxima, N D minima, and (2N ) D ¡ 2N D saddles. Therefore, the ratio maxima:minima:saddles is 1:1:2D ¡ 2, and there are O (2D ) saddle points per maximum or minimum. The expression is also valid for D D 1 (where there exist no saddle points). Figure 5 (left) shows the proof setup for D D 3. The groups maxima of a maximum knot are at distances p of nearest-neighboring p 1 (), 2 ( ), and 3 (4) and correspond to the 6 faces, 12 edges, and 8 vertices, respectively, of a cube of side 2 centered on the knot. At the midpoint of every such diagonal, there is either a saddle point or a minimum. In less crystalline arrangements, some maxima, minima, and saddle points will coalesce, but if the function under consideration has many maxima uniformly scattered over its domain, we would expect the ratio to remain approximately correct. Figure 5 (right) shows the landscape of a bivariate function with many maxima. Hence, saddle points are typically much more numerous than maxima and minima in high dimensions. Acknowledgments We thank Peter Dayan and Graeme Mitchison for helpful discussions and comments on this article, and particularly Nick Swindale for his willingness to discuss with us the issues it raises. References Bartfeld, E., & Grinvald, A. (1992).Relationships between orientation-preference pinwheels, cytochrome oxidase blobs, and ocular-dominance columns in primate striate cortex. Proc. Natl. Acad. Sci. USA, 89(24):11905–11909. Bentler, P. M., & Tanaka, J. S. (1983). Problems with EM algorithms for ML factor analysis. Psychometrika, 48(2), 247–251. Bialek, W. (1987).Physical limits to sensation and perception. Annu. Rev. Biophys. Biophys. Chem., 16, 455–478. Cherniak, C. (1995). Neural component placement. Trends Neurosci., 18(12), 522– 527. Das, A. (2000). Optimizing coverage in the cortex. Nat. Neurosci., 3(8), 750–752. Das, A., & Gilbert, C. D. (1997). Distortions of visuotopic map match orientation singularities in primary visual cortex. Nature, 387(6633), 594–598. Durbin, R., & Mitchison, G. (1990). A dimension reduction framework for understanding cortical maps. Nature, 343(6259), 644–647. Durbin, R., Szeliski, R., & Yuille, A. (1989).An analysis of the elastic net approach to the traveling salesman problem. Neural Computation, 1(3), 348–358. Durbin, R., & Willshaw, D. (1987). An analogue approach to the traveling salesman problem using an elastic net method. Nature, 326 (6114), 689–691.
1560
Miguel A. Carreira-Perpin ˜ an and Geoffrey J. Goodhill
Erwin, E., Obermayer, K., & Schulten, K. (1995). Models of orientation and ocular dominance columns in the visual cortex: A critical comparison. Neural Computation, 7(3), 425–468. Goodhill, G. J., & Willshaw, D. J. (1990). Application of the elastic net algorithm to the formation of ocular dominance stripes. Network: Computation in Neural Systems, 1(1), 41–59. Hubel, D. H., & Wiesel, T. N. (1977). Functional architecture of the macaque monkey visual cortex. Proc. R. Soc. Lond. B, 198(1130), 1–59. Hubener, ¨ M., Shoham, D., Grinvald, A., & Bonhoeffer, T. (1997). Spatial relationships among three columnar systems in cat area 17. J. Neurosci., 17(23), 9270–9284. Nocedal, J., & Wright, S. J. (1999). Numerical optimization. New York: SpringerVerlag. Obermayer, K., & Blasdel, G. G. (1993). Geometry of orientation and ocular dominance columns in monkey striate cortex. J. Neurosci., 13(10), 4114–4129. Swindale, N. V. (1991). Coverage and the design of striate cortex. Biol. Cybern., 65(6), 415–424. Swindale, N. V. (1996). The development of topography in the visual cortex: A review of models. Network: Computation in Neural Systems, 7(2), 161–247. Swindale, N. V., Shoham, D., Grinvald, A., Bonhoeffer, T., & Hubener, ¨ M. (2000). Visual cortex maps are optimised for uniform coverage. Nat. Neurosci., 3(8), 822–826. Received April 24, 2001; accepted October 29, 2001.
NOTE
Communicated by Klaus Obermayer
Kernel-Based Topographic Map Formation by Local Density Modeling Marc M. Van Hulle
[email protected] K. U. Leuven, Laboratorium voor Neuro- en Psychofysiologie, Leuven, Belgium We introduce a new learning algorithm for kernel-based topographic map formation. The algorithm generates a gaussian mixture density model by individually adapting the gaussian kernels’ centers and radii to the assumed gaussian local input densities. 1 Introduction
Rather than developing topographic maps with disjoint and uniform activation regions (Voronoi tessellation), such as in the case of the popular self-organizing map (SOM) algorithm (Kohonen, 1995) and its adapted versions, algorithms have been introduced that can accommodate neurons with overlapping activation regions, usually in the form of kernel functions, such as radially symmetric gaussians. For these kernel-based topographic maps, several learning principles and schemes have been proposed (for a review, see Van Hulle, 2000). One of the earliest examples is the elastic net of Durbin and Willshaw (1987), which can be viewed as a gaussian mixture density model, tted to the data points by a penalized maximum likelihood term. The standard deviations of the gaussians (radii) are all equal and are gradually decreased over time. More recently, Bishop, SvensÂen, and Williams (1998) introduced the generative topographic map, which is based on constrained gaussian mixture density modeling—constrained since the gaussians cannot move independently from each other (the map is topographic by construction). Furthermore, all gaussians have an equal and xed radius. Sum, Leung, Chan, and Xu (1997) maximized the correlations between the activities of neighboring lattice neurons. The radii of the gaussians are under external control, and are gradually decreased over time. Graepel, Burger, and Obermayer (1997) introduced the soft topographic vector quantization (STVQ) algorithm and showed that a number of probabilistic, SOM-related topographic map algorithms can be regarded as special cases. The gaussian kernel represents a fuzzy membership (in clusters) function, and its radius, which is equal for all neurons, is again under external control (deterministic annealing). In the kernel-based maximum entropy learning rule (kMER) (Van Hulle, 1998), the outputs of the gaussians are thresholded and the radii individually adapted so as to make the c 2002 Massachusetts Institute of Technology Neural Computation 14, 1561– 1573 (2002) °
1562
Marc M. Van Hulle
neurons (suprathreshold) active with equal probabilities (equiprobabilistic maps). In the kernel-based soft topographic mapping (STMK) algorithm (Graepel, Burger & Obermayer, 1998), a nonlinear transformation is introduced that maps the inputs to a high-dimensional feature space and also admits a kernel function, an idea borrowed from support vector machines (SVMs) (Vapnik, 1995; Sch¨olkopf, Burges, & Smola, 1999). The kernel operates in the original input space, but its parameters are not modied by the algorithm. This connection with SVMs was taken up again by AndrÂas (2001), but with the purpose of linearizing the class boundaries. The kernel radii are adapted individually so that the map’s classication performance is optimized (using supervised learning). Yin and Allinson (2001) proposed an algorithm aimed at minimizing the Kullback-Leibler divergence (also called relativeor cross-entropy) between the true and the input density estimate obtained from the individually adapted gaussian kernels in the topographic map. In this contribution, we propose a still different approach: the gaussian kernels are adapted individually and in an unsupervised manner, but in such a way that their centers and radii correspond to those of the assumed gaussian local input densities. 2 Kernel-Based Topographic Map Formation
Let A be a lattice of N formal neurons and V µ R d the input space. To each neuron i 2 A corresponds a weight vector w i D [wi1 , . . . , wid ] 2 V and a lattice coordinate r i 2 VA , with VA the lattice space (we assume discrete lattices with regular topologies). As an input to lattice space transformation, we take one that admits a kernel function: hY (v ) , Y (w i ) i D K ( v , w i ), with v 2 V, Y the transformation, and h., .i the internal product operator in VA . We prefer to use gaussian kernels: ³ ´ kv ¡ w i k2 . (1) K ( v , w i , si ) D exp ¡ 2si2 When performing topographic map formation, we require that the weight vectors are updated so as to minimize the expected value of the squared Euclidean distance kv ¡ w i k2 and, hence, following our transformation Y, we instead wish to minimize kY (v ) ¡ Y ( w i ) k2 , which we will achieve by performing gradient descent with respect to w i . We rst determine the gradient: @ @w i
kY ( v ) ¡ Y ( w i )k2 D
@ @w i
¡2
K ( w i , w i , si ) C
@ @w i
K (v , v , si )
@
K ( v, w i , si ) ³ ´ kv ¡ w i k2 @ exp ¡ . D ¡2 2si2 @w i @w i
(2)
Kernel-Based Topographic Map Formation
1563
Hence, the learning rule for the kernel centers w i becomes (without neighborhood function—see further)
D wi
D gw
(v ¡ wi ) K ( v , w i , si ) , si2
(3)
with gw the learning rate. The equilibrium weight vector can be written as R
eq
eq
w i D RV
v K ( v , w i , si ) p ( v ) dv eq
V
K ( v , w i , si ) p ( v ) dv
,
8i 2 A,
(4)
that is, the center of gravity of K (.) p (.) , or, when we dene the product of the input density and the gaussian kernel as a new, local input density eq p¤ ( v ) D K ( v , w i , si ) p ( v )—note that p¤ is normalized by the denominator eq in equation 4—then we can rewrite the equilibrium weight vector as w i D hv ip¤ . In order to achieve a topology-preserving mapping, we multiply the right-hand side of equation 3 with a neighborhood function L ( i, i¤ ), with i¤ D arg max8i2A ( K (v , w i , si ) ) , that is, an activity-based denition of winnertakes-all rather than a minimum Euclidean distance-based denition (i¤ D arg mini kw i ¡ v k), which is usually adopted in topographic map formation. The equilibrium weight vector now becomes R eq wi
eq
D RV
v L ( i, i¤ ) K ( v , w i , si ) p ( v ) dv eq
V
L ( i, i¤ ) K (v , w i , si ) p ( v ) dv
,
8i 2 A.
(5)
We now derive the learning rule for the kernel radii si . We consider two steps. First, we perform gradient descent on kY ( v ) ¡ Y ( w i ) k2 , but with respect to si , so that after some algebraic manipulations:
D si /
kv ¡ w i k2 si3
K (v , w i , si ) ,
8i 2 A.
(6)
Second, we wish each radius si to correspond to the standard deviation of a d-dimensional gaussian input distribution centered at the (equilibrium) weight vector of neuron i so that, together with the latter, the assumed gaussian local input density is modeled. Since the expected value of the Euclidean p distance to the center of a d-dimensional gaussian can be approximated as ds for d large (Graham, Knuth, & Patashnik, 1994), 1 the radius update rule becomes (without neighborhood function)
D si
D gs
1 si
³
´ kv ¡ w i k2 ¡ rd K (v , w i , si ) , si2
8i 2 A,
(7)
1 Note that the Euclidean distance distribution is chi-squared with h D 2 and a D degrees of freedom.
d 2
1564
Marc M. Van Hulle
with gs the learning rate. The scale factor r (a constant) is designed to relax the local gaussian (and d large) assumption in practice. The equilibrium condition for the kernel radii can be written as 2eq
si
D
1 rd
R
2eq
V
kv ¡ w i k2 K (v , w i , si ) p (v ) dv , R 2eq ( ) ( ) v w s v v K , , p d i i V
that is, the moment of inertia of p¤ (without the 2eq si
D
8i 2 A, 1 rd
(8)
term), or for short,
1 2 ¤ r d hkv ¡w i k ip . Finally, similar to the weight update rule, we multiply
the right-hand side of equation 7 with our neighborhood function L (.) . The corresponding equilibrium radius then becomes 2eq
si
D
1 rd
R
2eq
V
L ( i, i¤ ) kv ¡ w i k2 K ( v , w i , si ) p ( v ) dv , R 2eq ( ¤) ( ) p ( v ) dv V L i, i K v , w i , si
8i 2 A.
(9)
Equations 5 and 9 are in the form in which the so-called iterative contraction mapping (xed-point iteration), used for solving nonlinear equations, can be directly applied (thus without learning rates). 2 Such an iterative process is, at least for the weights, reminiscent of the so-called batch map version of the SOM algorithm (Kohonen, 1995; Mulier & Cherkassky, 1995). This observation leads us to the soft topographic vector quantization (STVQ) algorithm (Graepel et al., 1997), a general model for probabilistic, SOM-based topographic map formation, since several algorithms can be considered as special cases, including Kohonen’s batch map version. In appendix A, we detail the correspondence of our learning scheme, which we further call the local density estimation (LDE) algorithm, with the STVQ, the batch map, and also the kernel-based Soft Topographic Mapping (STMK) algorithm (Graepel et al., 1998). As an example, consider the standard case of a 10£10 lattice, trained on samples drawn from the uniform distribution in the unit square (0, 1]2 . The weights are initialized by sampling the same distribution; the radii are initialized by randomly sampling the± (0, 1] range. ² We use a gaussian neigh¡ri¤ k borhood function: L (i, i¤ , sL ) D exp ¡ kr i 2s 2 L
2
, with sL the neighborhood
radius, and r i neuron i’s lattice coordinate. ± We adopt ² the following neighborhood cooling scheme: sL ( t) D sL 0 exp ¡2sL 0 t t , with t the present and max
tmax the maximum number of time steps, and sL 0 the radius at t D 0. We take tmax D 500, 000, sL 0 D 5, gw D 0.01 and gs D 0.02gw , and r D 0.4. The results are shown in Figure 1. We observe that prior to the lattice disentangling 2 Note that, strictly speaking,since we opt for a winner-takes-all rule, for computational reasons, we should integrate over the subspace of V that leads to an update of neuron i’s kernel parameters.
Kernel-Based Topographic Map Formation
1565
0
1k
10k
30k
100k
500k
Figure 1: Evolution of a 24 £ 24 lattice as a function of time. The circles demarcate the areas spanned by the standard deviations (radii) of the corresponding gaussian kernels. The boxes correspond to the range spanned by the uniform input density. The values given below the boxes represent time.
phase, the kernel radii grow rapidly in size and span a considerable part of the input distribution. Furthermore, for larger r, the lattice disentangles more rapidly, and the radii at convergence are larger. Hence, what happens in case the radii are forced to stay small? Will the lattice still disentangle? This is exemplied in Figure 2 for the same initial weight distribution but with the radii kept constant. The lattice disentangles, albeit at a much slower rate (tmax D 2M). 3 Density Estimation Performance
We now explore the density estimation performance of our LDE algorithm by using samples drawn from the distribution shown in Figure 3A (quadri-
1566
Marc M. Van Hulle
0
1k
10k
100k
200k
2M
Figure 2: Evolution when the radii are kept constant at a value of 0.1. Same conventions as in Figure 1.
modal product distribution). We will also use this case as a benchmark for comparing the performances of two modied versions of our algorithm and of four other kernel-based topographic map formation algorithms. The analytic equation of the product distribution is, in the rst quadrant, (¡ log v1 ) (¡ log v2 ) , with ( v1 , v2 ) 2 (0, 1]2 , and so on. Each quadrant is chosen with equal probability. The resulting asymmetric distribution is unbounded and consists of four modes separated by sharp transitions (discontinuities), which makes it difcult to model. The support of the distribution is bounded by the unit square [¡1, 1]2 . We take a 24 £ 24 lattice and train it in the same way as in the previous example, but with tmax D 1,000,000, gw D 0.01, and gs D 0.1gw . The density estimate is obtained by taking the sum of all (equal-volume) gaussian kernels at convergence, normalized so as to obtain a unit-volume density estimate. The result is shown in Figure 3B.
Kernel-Based Topographic Map Formation
A
1567
B
2 pd 1 0 -1
1 0 v2 v1
0
2 pd 1 0 -1
1 0 v2 v1
1 -1
0
1 -1
C
pd
2 1 0 -1
1
pd
0 v2 v1
0
2 1 0 -1
1 0 v2 v1
1 -1
0
1 -1
F
4
|
2
|
number of clusters
6
|
0| 0
|
1
8
|
r2
10
|
24
1
r1
24
| 50
| | | | 100 150 200 250 number of neurons (k)
Figure 3: (A–D) Density estimation with kernel-based topographic maps. Twodimensional quadrimodal product distribution (A), and the density estimates obtained with our learning algorithm, when adapting the kernel radii (B), and when keeping them xed (C), and the estimate obtained with the STVQ algorithm (D). The size of the lattices is 24 £ 24 neurons. pd = probability density. (E, F) Hill-climbing procedure applied in lattice space on the density estimate shown in B. We rst determine the density estimates at the neurons’ weight vectors, since hill climbing is performed on them. For each neuron i, we look for the neuron with the highest density estimate among itself and its k-nearest lattice neighbors. When this is neuron i itself, it is called a “top” neuron, since it represents a local density maximum; when it is another neuron j, then the procedure is repeated for neuron j, and we say that neuron i “points” to neuron j. All neurons that eventually point to the same top neuron receive the same cluster label. The result is represented as a cluster map (E). In order to motivate the choice of the parameter k in the hill-climbing procedure, the number of clusters found as a function of k is plotted (F). The long plateau indicates that there are four clusters. The cluster map corresponding to k D 100 is shown in E.
1568
Marc M. Van Hulle
Table 1: Density Estimation Performance of Several Kernel-Based Topographic Map algorithms for the Product Distribution Example in Figure 3A. Algorithm LDE LDE (min Eucl) LDE (xed radii) kMER Yin & Allinson (2001) STVQ SSOM SOM
1 b 1 b
Parameters
MSE
gw D 0.01, gs D 0.1gw gw D 0.01, gs D 0.1gw gw D 0.01
6.23 £ 10¡2 6.77 £ 10¡2 9.78 £ 10¡2
rr D 1, rs D 2, g D 0.001 g D 0.01 D 0.01, sL D 0.5, sopt D 0.0905 D 0.01, sL D 0.5, sopt D 0.0925
g D 0.015, sopt D 0.15
6.71 £ 10¡2 7.16 £ 10¡2 7.48 £ 10¡2 7.28 £ 10¡2 1.21 £ 10¡1
The mean squared error (MSE) between the estimated and the theoretical distribution is 6.23 £ 10¡2 . In order to explore the contribution of our activity-based competition rule, i¤ D arg maxi yi , we also train our lattice using the classic Euclidean distance-based rule, i¤ D arg mini kw i ¡ vk, but with the radii adapted as before, using equation 7 (“LDE min Eucl” case in Table 1). The MSE is now 6.77 £ 10¡2 , which is inferior to our initial result. Furthermore, in order to see the effect of individually adapting the kernel radii, we also consider the case where the radii are equal and kept constant during learning (“LDE xed radii” case). But here the following problem arises, as it does in other algorithms that do not adapt the kernel radii: how to choose this radius, since it directly determines the smoothness of the density estimate? In order not to have to rely on a separate optimal smoothness procedure, which could bias our results, we determine the (common) kernel radius for which the MSE between the estimated and the true theoretical distributions is minimal. In this way, we at least know that a better MSE result cannot be obtained. The “optimal” MSE found is 9.78 £ 10¡2 for sopt D 0.195 (optimized in steps of 5£10¡3 ). The result is shown in Figure 3C. This clearly shows the advantage of adapting the kernel radii individually to the local input density. For the sake of comparison, we also consider four other kernel-based topographic map algorithms, provided that their kernels specify a density estimate in the input space directly. We use the following algorithms: the kernel-based maximum entropy learning rule (kMER) (Van Hulle, 1998), the algorithm of Yin and Allinson (2001), and the STVQ and soft-SOM (SSOM) algorithms (Graepel et al., 1997). All simulations are run on the same input data, using the same cooling scheme (except for the SSOM and STVQ algorithms; see further), and the same number of iterations, as before. For the STVQ algorithm, we take for the neighborhood function radius sL D 0.5, and for the equal and constant kernel radii (“temperature param-
Kernel-Based Topographic Map Formation
1569
eter”) b1 D 0.01, as suggested in Graepel et al. (1997). 3 We also adopt these parameter values for the SSOM algorithm, since it is in fact a limiting case of the STVQ algorithm (see appendix A). Furthermore, again for the SSOM and STVQ algorithms, since they do not adapt their kernel radii, we look for the (common) kernel radius that optimizes the MSE between the estimated and the theoretical distributions, as explained above. The result for the STVQ algorithm is shown in Figure 3D. Finally, we also consider the original SOM algorithm, add gaussian kernels at the converged weight vectors, and look for the (common) kernel radius that minimizes the MSE. The results for these algorithms are summarized in Table 1, together with the parameters used. 4 Density-Based Clustering
The density estimation facility described above can be used for visualizing clusters in data distributions by performing density-based clustering in lattice space. For each neuron in the lattice, the density estimate at the neuron’s weight vector is determined, and the result displayed graphically, on a relative scale, in the lattice. Clusters then correspond to high-density regions in the lattice, and they can be found and demarcated by a procedure such as hill climbing (see the caption for Figures 3E and 3F). An efcient tree-based hill-climbing algorithm is given in appendix B. This procedure could be regarded as an alternative to the gray-level clustering procedure that has been devised for the SOM and used for data mining purposes (Kohonen, 1995; Lagus & Kaski, 1999). Here, for each neuron, the average distance to the weight vectors of the neuron’s lattice neighbors is determined, rather than a local estimate of the input density. Finally, density-based clustering itself can be regarded as an alternative to the (usually Euclidean) distance-based similarity criterion, which in most cases is adopted for clustering with competitive learning and topographic maps (Luttrell, 1990; Rose, Gurewitz, & Fox, 1990; Graepel et al., 1997): distance-based clustering assumes that the cluster shape is hyperspherical, at least for the Euclidean case, whereas density-based clustering does not make any assumptions about the cluster shape. Appendix A: Correspondence with STVQ, SOM, and STMK Algorithms
The soft topographic vector quantization (STVQ) algorithm (Graepel et al., 1997) performs a fuzzy assignment of data points to clusters, whereby each cluster corresponds to a single neuron. It also serves as a general model for 3 Note that our input distribution also spans the unit square [¡1, 1]2 , as in (Graepel et al. (1997).
1570
Marc M. Van Hulle
probabilistic SOM-based topographic map formation. The weight vectors represent the cluster centers and are determined by iterating the following equilibrium equation (put into our format): R eq wi
v
P
V D R P V
j
L ( i, j) P (v 2 C j ) p ( v ) dv
j L ( i,
j ) P ( v 2 C j ) p ( v ) dv
,
8i 2 A,
with P ( v 2 C j ) the assignment probability of data point v to cluster C the probability of “activating” neuron j), which is given by ± ² P exp ¡ b2 k L ( j, k) kv ¡ w kk2 ± ², P (v 2 C j ) D P b P 2 l exp ¡ 2 k L ( l, k) kv ¡ w k k
(A.1)
j
(i.e.,
(A.2)
with b the inverse temperature parameter and L (i, j) the transition probability of the noise-induced change of data point v from cluster C i to C j . A number of topographic map algorithms can be considered as special cases by putting b ! 1 in equation A.2, and L ( i, j) D dij in equation A.2 and/or A.1. For example, the soft SOM (SSOM) algorithm (Graepel et al., 1997) is obtained by putting L ( i, j) D dij in equation A.2, but not in equation A.1. Kohonen’s batch map version (Kohonen, 1995; Mulier & Cherkassky, 1995) is obtained for b ! 1 and L ( i, j) D dij in equation A.2, but not in equation A.1, and for i¤ D arg minj kv ¡wj k2 (i.e., distance-based winner-takes-all rule). Our LDE algorithm is different from the STVQ algorithm in three ways. First, in the STVQ algorithm, the kernel P (.) in equation (A.2) represents a fuzzy membership (in clusters) function, that is, the softmax function, normalized with respect to the other neurons in the lattice, with the degree of fuzzication depending on b. In our case, the kernel K (.) operates in the input space, instead of the (discrete) lattice space, and represents a local density estimate. 4 Our algorithm is also different from Kohonen’s batch map by the denition of the kernel, which is in Kohonen’s case the neighborhood function L , whereas in our case we have both K (.) and L (.) , and by the denition of the winner i¤ (distance- versus activity-based). Second, instead of using kernels P (.) with equal radii b1 , with b externally controlled (deterministic annealing or “cooling”), our radii differ from one another and are individually adapted. Third, the kernels also differ conceptually since, in the STVQ algorithm, the kernel radii are related to the magnitude of the noise-induced change in the cluster assignment (thus, in lattice space), whereas in our case, they are related to the standard deviations of the local input density estimates (thus, in input space). 4 Technically, the “kernel” in the STVQ algorithm represents a probability distribution, whereas our kernel represents a probability density.
Kernel-Based Topographic Map Formation
1571
In the kernel-based soft topographic mapping (STMK) algorithm (Graepel et al., 1998), a nonlinear transformation Y, from the original input space V to some feature space F , is introduced that admits a kernel function: hY (x ) , Y ( y ) i D K (x , y ), with K (.) , for example, a gaussian, and with h., .i the internal product operator in F -space. The topographic map’s parameters (“weights”) are expressed in this feature space as linear combinations of the P transformed inputs: w i D mMD1 am i Y ( vm ), given a batch of M input samples fvm g. The coefcients am i are determined by iterating an equilibrium equation. Furthermore, as in the STVQ algorithm, soft assignments of the inputs to clusters are made, P ( vm 2 C j ). Our algorithm differs in several ways from the STMK algorithm. First, the topographic map’s parameters am i are developed in the feature space F rather than in the input space V, as in our LDE algorithm. Second, the STMK algorithm does not update the kernels’ parameters. In fact, when initializing the STMK algorithm, the kernel-transformed data fK ( vm , vº )g are determined, and they replace the original input data (see p. 182 Graepel et al., 1998). In our LDE algorithm, both the kernel centers and the kernel radii are updated during learning. Third, the (transformed) inputs are assigned in probability to clusters in the STMK algorithm, whereas in our case, the (original) inputs are assigned with an activity-based winner-takes-all rule. Appendix B: Tree-Based Hill-Climbing Algorithm
Conventions: D D lattice number of a neuron; Top D lattice number of the neuron which has the highest density estimate of all k C 1 neurons in the current hypersphere; Ultimate Top D lattice number of the neuron, which represents a local maximum in the input density. /* Three types of nodes: leaves : (0, D, successor, top ) nodes : ( predecessor, D, successor, top) tops : ( predecessor, D, D, D ) /* /* Initialize all neurons, by labeling them all as leaves */ for ( i ( 1I i · NI i ( i C 1) label[i] ( (0, i, 0, 0) /* Search for all neurons the top of the cluster they belong to */ /* and with respect to the k nearest neighbors (parameter)*/ for ( i ( 1I i · NI i ( i C 1) if ( label[i].successor D D 0) /* Neuron pointer is not processed until now */ pointer ( i /* pointer will point to the neuron currently being processed */ while ( label[pointer]. D ! D label[pointer].successor) top ( MaxNeighbor ( pointer, k) label[pointer].successor ( top
1572
Marc M. Van Hulle if ( top ! D pointer )
/* pointer is not ultimate top-neuron of cluster */ label[top].predecessor ( pointer pointer ( top
else
label[top].top ( top
if ( label[pointer].successor ! D 0)
/* Current neuron is processed and leads to a top */ /* No further processing is needed ) quit while-loop */ /* First give all parsed neurons D of ultimate top */ top ( label[pointer].top while ( label[pointer].predecessor ! D 0) pointer ( label[pointer].predecessor label[pointer].top ( top break
Acknowledgments
M.M.V.H. is supported by research grants received from the Fund for Scientic Research (G.0185.96N), the National Lottery (Belgium) (9.0185.96), the Flemish Regional Ministry of Education (Belgium) (GOA 95/99-06; 2000/11), the Flemish Ministry for Science and Technology (VIS/98/012), and the European Commission, Fifth Framework Programme (QLG3-CT2000-30161 and IST-2001-32114). References Andr a s, P. (2001). Kernel-Kohonen networks. Manuscript submitted for publication. Bishop, C. M., Svense n, M., & Williams, C. K. I. (1998). GTM: The generative topographic mapping. Neural Computat., 10, 215–234. Durbin, R., & Willshaw, D. (1987). An analogue approach to the travelling salesman problem using an elastic net method. Nature, 326, 689–691. Graepel, T., Burger, M., & Obermayer, K. (1997). Phase transitions in stochastic self-organizing maps. Physical Rev. E, 56(4), 3876–3890. Graepel, T., Burger, M., & Obermayer, K. (1998). Self-organizing maps: Generalizations and new optimization techniques. Neurocomputing, 21, 173–190. Graham, R. L., Knuth, D. E., & Patashnik, O. (1994). Concrete mathematics: A foundation for computer science. Reading, MA: Addison-Wesley. Kohonen, T. (1995). Self-organizing maps. Heidelberg: Springer-Verlag. Lagus, K., & Kaski, S. (1999). Keyword selection method for characterizing text document maps. In Proc. ICANN99, 9th Int. Conf. on Articial Neural Networks (Vol. 1, pp. 371–376). London: IEE. Luttrell, S. P. (1990). Derivation of a class of training algorithms. IEEE Trans. Neural Networks, 1, 229–232.
Kernel-Based Topographic Map Formation
1573
Mulier, F., & Cherkassky, V. (1995). Self-organization as an iterative kernel smoothing process. Neural Computat., 7, 1165–1177. Rose, K., Gurewitz, E., & Fox, G. C. (1990). Statistical mechanics and phase transitions in clustering. Phys. Rev. Lett., 65(8), 945–948. Sch¨olkopf, B., Burges, C. J. C., & Smola, A. J. (1999). Advances in kernel methods: Support vector learning. Cambridge, MA: MIT Press. Sum, J., Leung, C.-S., Chan, L.-W., & Xu, L. (1997). Yet another algorithm which can generate topography map. IEEE TNNS, 8(5), 1204–1207. Van Hulle, M. M. (1998). Kernel-based equiprobabilistic topographic map formation. Neural Computat., 10(7), 1847–1871. Van Hulle, M. M. (2000). Faithful representations and topographic maps: From distortion- to information-based self-organization. New York: Wiley. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Yin, H., & Allinson, N. M. (2001). Self-organizing mixture networks for probability density estimation. IEEE Trans. Neural Networks, 12, 405–411. Received March 8, 2001; accepted October 24, 2001.
LETTER
Communicated by Emilio Salinas
A Simple Model of Long-Term Spike Train Regularization Relly Brandman
[email protected] Department of Computer Science and Beckman Institute for Advanced Science and Technology, University of Illinois, Urbana, IL 61801, U.S.A. Mark E. Nelson
[email protected] Department of Molecular and Integrative Physiology and Beckman Institute for Advanced Science and Technology, University of Illinois, Urbana, IL 61801, U.S.A. A simple model of spike generation is described that gives rise to negative correlations in the interspike interval (ISI) sequence and leads to long-term spike train regularization. This regularization can be seen by examining the variance of the kth-order interval distribution for large k (the times between spike i and spike i C k). The variance is much smaller than would be expected if successive ISIs were uncorrelated. Such regularizing effects have been observed in the spike trains of electrosensory afferent nerve bers and can lead to dramatic improvement in the detectability of weak signals encoded in the spike train data (Ratnam & Nelson, 2000). Here, we present a simple neural model in which negative ISI correlations and long-term spike train regularization arise from refractory effects associated with a dynamic spike threshold. Our model is derived from a more detailed model of electrosensory afferent dynamics developed recently by other investigators (Chacron, Longtin, St.-Hilaire, & Maler, 2000;Chacron, Longtin, & Maler, 2001).The core of this model is a dynamic spike threshold that is transiently elevated following a spike and subsequently decays until the next spike is generated. Here, we present a simplied version—the linear adaptive threshold model—that contains a single state variable and three free parameters that control the mean and coefcient of variation of the spontaneous ISI distribution and the frequency characteristics of the driven response. We show that refractory effects associated with the dynamic threshold lead to regularization of the spike train on long timescales. Furthermore, we show that this regularization enhances the detectability of weak signals encoded by the linear adaptive threshold model. Although inspired by properties of electrosensory afferent nerve bers, such regularizing effects may play an important role in other neural systems where weak signals must be reliably detected in noisy spike trains. When modeling a neuronal system that exhibits this type of ISI correlation structure, the linear adaptive threshold model may c 2002 Massachusetts Institute of Technology Neural Computation 14, 1575– 1597 (2002) °
1576
Relly Brandman and Mark E. Nelson
provide a more appropriate starting point than conventional renewal process models that lack long-term regularizing effects. 1 Introduction
When a spiking neuron encodes an input signal, subsequent processing of that signal by postsynaptic neurons must be based on changes in the statistical properties of the output spike train. If there is background spike activity, then the variability of the background will inuence how reliably other neurons can detect the presence of a weak signal encoded in the spike train data. The variability of a spike train is often characterized by the coefcient of variation (CV) of the rst-order interspike interval (ISI) distribution. However, the rst-order ISI distribution provides information about variability only on short timescales comparable to the mean ISI (for review, see Gabbiani & Koch, 1998). It is possible for a spike train to be irregular on short timescales but regular on longer timescales, as we have shown experimentally for P-type (probability-coding) electrosensory afferent nerve bers in a weakly electric sh (Ratnam & Nelson, 2000). This longer-term regularization can be observed by analyzing the kth-order interval distribution (the distribution of time intervals between spike i and spike i C k). If successive ISIs in the spike train are uncorrelated, then the variance of the kth-order distribution will be a factor of k times larger than the variance of the rst-order ISI distribution. However, in our experimental study of electrosensory afferents, we found that the variance between, say, every ftieth spike in the spike train was signicantly smaller than would be expected if successive ISIs were uncorrelated. We further demonstrated that this regularization is associated with negative correlations in the ISI sequence and that the detectability of a weak signal can be signicantly enhanced when such regularization exists. The negative correlation structure and regularizing effects observed in the data have recently been reproduced in a modeling study based on a stochastic model of ring dynamics (Chacron, Longtin, St. Hilaire, & Maler, 2000; Chacron, Longtin, & Maler, 2001). Refractory effects are known to have a short-term regularizing inuence on spike activity by decreasing the CV of the rst-order ISI distribution and increasing the temporal precision of the driven response (Berry & Meister, 1998). Refractory effects are often modeled by introducing a recovery function that reduces the ring probability immediately following a spike (for reviews, see Berry & Meister, 1998; Johnson, 1996). In such models, refractory effects are dependent only on the time of the previous spike and are not sensitive to the duration of previous interspike intervals. If the input is held constant in such models, then successive intervals are independent and identically distributed. In this case, no correlations are introduced into the ISI sequence. For such renewal models, the refractory mechanism has no impact on the long-term regularity of the spike train. In contrast, the
Spike Train Regularization
1577
refractory mechanism presented here is implemented as a dynamic state variable that retains a memory of previous activity spanning multiple interspike intervals. This nonrenewal model of spike generation gives rise to negative correlations in the ISI sequence and long-term regularization of the spike train. Here we present a simple model of a spike-generating mechanism that gives rise to regularizing effects similar to those observed in electrosensory afferent spike trains. Our model is inspired by the more detailed model of Chacron et al. (2000, 2001), in which they showed that a stochastic model of spike generation with a dynamic threshold is able accurately to describe the key features of spike trains observed in the electrosensory afferent data (Nelson, Xu, & Payne, 1997; Ratnam & Nelson, 2000). To achieve a good match with the data, their model included about 15 parameters. However, because our model has only 3 parameters and one state variable, the relationships between the model parameters and the spike train properties are more readily apparent. Because of its simplicity, the model is easily adaptable to many neural modeling applications. In particular, it is a better choice than more widely used renewal process models when modeling spike trains that exhibit long-term regularizing effects. 2 The Linear Adaptive Threshold Model
The goal of this simplied model is to obtain a minimal description of the spike-generating mechanism that gives rise to long-term spike train regularization. This simplied model is intended to serve as a generic basis for constructing more detailed system-specic models, as illustrated by the example in section 5. Although the model is highly simplied, it captures the important dynamic features of the process and reects a level of abstraction similar to that of the well-known integrate-and-re model (Stein, 1967). An important simplication is that the model presented here uses a linear decay function rather than the exponential threshold decay function used by Chacron et al. (2000, 2001). As we will show, this results in a simpler relationship between the model parameters and the spike train characteristics. Finally, the model presented here is formulated in a discrete-time framework, although it can also be cast in continuous time. A discrete-time formulation has the advantage of avoiding complications associated with the numerical integration of gaussian noise in continuous time and for this reason is more computationally efcient because it requires fewer integration steps per unit time. We are currently using an extended version of this model (see section 5) to simulate the neural activity of the entire population of 15,000 P-type electrosensory afferent nerve bers of an electric sh, so matters of computational efciency become of practical importance. The linear adaptive threshold model contains three essential parameters (a, b, and s), and a single dynamic state variable, the spike threshold h. For the sake of generality, we also include a fourth parameter, c, the input
1578
Relly Brandman and Mark E. Nelson
gain, which we will subsequently take to be unity. As will be shown in section 4, the parameter c is redundant in terms of functionality, it is included to facilitate conceptualization of the model in a neural framework. If one wishes to think of the input as a current and the threshold as a voltage level, then the gain parameter c takes on units of electrical resistance. Figure 1 illustrates the operating principles of the model. The model is described by four update rules, which are evaluated in the following order at each time step n: (2.1)
v[n] D ci[n] C w[n]
h [n] D h[n ¡ 1] ¡ ( b / a) s[n] D H ( v[n] ¡ h[n]) D
(
(2.2) 1 0
if v[n] ¸ h [n] otherwise
( h[n] C b h [n] D h[n] C bs[n] D h[n]
if s[n] D 1 otherwise
(2.3)
(2.4)
where H is the Heaviside function, dened as H ( x ) D 0 for x < 0 and H ( x ) D 1 for x ¸ 0. The voltage v is the product of the input resistance c and the instantaneous input current i, plus random noise w, where w is zero-mean gaussian noise with variance s 2 . (In section 5, we show that the model can easily be extended to include the effects of a membrane time constant, but this extension is not necessary for understanding the regularizing effects of the model.) When the voltage v rises above a threshold level h, a spike is generated ( s D 1) , and the threshold level is elevated by an amount b. The threshold subsequently decays linearly with a slope of ¡b / a until the next spike is generated. From equation 2.2 alone, one might get the impression that the threshold h is unbounded and could decay to arbitrarily large negative values. However, because the threshold level is boosted whenever h < v, the voltage level v serves as the effective lower bound for the threshold. The output of the model is a binary spike train s, with s[n] D 1 if a spike was generated at time step n and s[n] D 0 otherwise. The model parameters are restricted to a > 1, b > 0, and s > 0. The parameter a has units of time steps, while b and s have units of voltage. The update interval can be adjusted to meet the temporal resolution required for a specic modeling application. 3 Statistical Properties of Spontaneous Spike Activity in the Model 3.1 Mean and CV of the First-Order ISI. In the absence of an input signal ( i D 0) , the linear adaptive threshold model generates spontaneous spike activity. The parameter a controls the mean ISI, and the ratio s / b controls the CV of the ISI distribution. Representative spontaneous ISI distributions are shown in Figure 2. For a sufciently long spike train, the empirically
Spike Train Regularization
1579
s[n]
[n] v[n]
b a
Figure 1: Representative time history of variables in the linear adaptive threshold model. Model parameters: a D 5 msec, b D 1 mV, s D 0.2 mV, c D 1 MV. The input signal is a sinusoid with a period of 100 time steps: i[n] D sin (2p n / 100) nA. The voltage v[n], shown by the heavy solid line, is a noisy version of the input. The spike threshold h [n] is shown by the sawtooth-shaped solid line. A spike (s[n] D 1) is generated whenever the voltage crosses the threshold level. Immediately following each spike, the threshold is boosted by an amount b and subsequently decays linearly with a slope ¡b / a until the next spike is generated. Total duration shown in the gure is 100 time steps.
measured mean ISI (in time steps) becomes identical to a. The mathematical basis for this result is presented in section 3.4 (see equation 3.7). The CV of the ISI distribution can range between 0 and 1 and increases monotonically with s / b. The mean and CV of an experimentally observed spontaneous ISI distribution can be matched by appropriate adjustments of a and s / b and the size of the time step. In our experimental studies of electrosensory afferents in weakly electric sh, the frequency of the oscillatory electric organ discharge (EOD) signal provides a natural time reference. P-type afferents re at most one spike per EOD cycle (Scheich, Bullock, & Hamstra, 1973); hence, it is natural to set the step size equal to one EOD cycle. For brown ghost knifesh, Apteronotus leptorhynchus, the EOD frequency is extremely stable for an individual sh (Moortgat, Keller, Bullock, & Sejnowski, 1998) and ranges from about 600 to1200 Hz. The corresponding step size in the model would range from 0.8 to 1.7 msec. Figure 3A1 shows the spontaneous ISI distribution for a representative P-type afferent ber (Ratnam & Nelson, 2000). The ISI distribution has a mean of 2.9 EOD cycles and a CV of 0.46. Figure 3A2 shows the corresponding distribution for the linear adaptive threshold model with a D 2.9 msec, b D 2.0 mV, and s D 1.0 mV. Although the two distributions are clearly not identical, the mean and CV of the model ISI distribution match that of the data (mean D 2.9, CV D 0.46).
1580
Relly Brandman and Mark E. Nelson
prob
0.4
A1
mean=3.0 cv=0.77
0.2
0
0
prob
0.4
B1
0
15
10
50
15
0
0
0.2
0
a = 30 /b=1.0
50 C2
100
a = 30 /b=0.1 mean=30.0 cv=0.10
0.1
15
100
mean=30.0 cv=0.53
0.02
a=3 /b=0.1
5 10 interval
0
0.04
mean=3.0 cv=0.16
0
0
a = 30 /b=10.0 mean=30.0 cv=0.90
B2
a=3 /b=1.0
5 C1
0.5
0
10
A2
0.02
mean=3.0 cv=0.58
1
prob
5
0.2
0
0.04
a=3 /b=10.0
0
50 interval
100
Figure 2: Representative spontaneous ISI distributions obtained from the linear adaptive threshold model. The parameter values for a and s / b, as well as the empirically measured mean and CV of the ISI distribution, are shown in each panel. The parameter a controls the mean of the ISI distribution, and the ratio s / b controls the CV. The left three panels (A1–C1) show results for a relatively short mean ISI (a D 3 msec), while the right three panels (A2–C2) show results for a longer mean ISI (a D 30 msec). Simulation duration was 100,000 time steps. 3.2 Negative Correlations in the ISI Sequence. The linear adaptive threshold model gives rise to negative correlations between adjacent intervals in the ISI sequence, meaning that short intervals tend to be followed by long intervals, and vice versa. Similar effects are observed in electrosensory afferent data, as illustrated by the joint interval histograms of adjacent ISIs shown in Figures 3B1 and 3B2. In the experimental data, we observed
Spike Train Regularization
1581
Data
prob
0.3
Model 0.3
A1 mean=2.9
0.2
0
5 interval
interval j+1
0
4
4
2
2 0
0
5 interval
B2
6
1 variance/mean
10
6
0
2 4 6 interval j
0
1
C1
0.1
0.01
cv=0.46
0.1
B1
0
mean=2.9
0.2
cv=0.46
0.1 0
A2
10
2 4 6 interval j
C2
0.1
1
10 order k
100
0.01
1
10 order k
100
Figure 3: Spontaneous spike train properties of the linear adaptive threshold model compared with experimental data. The left side (A1–C1) shows the ISI distribution, joint interval histogram, and variance-to-mean ratio of the kthorder interval distribution for a representative P-type electrosensory afferent nerve ber from an electric sh (Ratnam & Nelson, 2000). The right side (A2– C2) shows the corresponding plots for the model with a D 2.9 msec, b D 2.0 mV, and s D 1.0 mV. The model is able to match the mean and variance of the rst-order ISI distribution (A1, A2), as well as qualitatively reproduce the shortlong correlations between neighboring intervals observed in the joint interval histogram (B1, B2), and the approximate decline as k¡1 in the variance-to-mean ratio (C1, C2). The dashed line in C1 and C2 indicates k¡1 . Simulation duration was 100,000 time steps.
1582
Relly Brandman and Mark E. Nelson
a mean correlation coefcient of ¡0.52 in a population of 52 P-type afferent spike trains (Ratnam & Nelson, 2000). For the particular unit shown in Figure 3B1, the correlation coefcient was ¡0.58, and for the model it was ¡0.40 (see Figure 3B2). The linear adaptive threshold model qualitatively captures the shortlong correlation structure of the ISI sequences observed in the data. In the model, the negative correlation structure arises because the decay function tends to leave the threshold at a higher level following a short interval than following a long interval. This short-long correlation structure has been observed experimentally in many neural systems (Kufer, Fitzhugh, & Barlow, 1957; Calvin & Stevens, 1968; Johnson, Tsuchitani, Linebarger, & Johnson, 1986; Lowen & Teich, 1992), and is one indication of spike train regularization. 3.3 Spike Train Variability on Longer Timescales. There are two simple ways to characterize the variability of a spike train on timescales longer than the mean ISI. The traditional way is to count the number of spikes occurring in nonoverlapping windows of xed duration T and examine how the variance of the count distribution changes with T. An alternative approach is to measure the time interval between every kth spike in the spike train and examine how the variance of the kth-order interval distribution changes with k. If the spike train arises from a renewal process (Cox, 1962), there are no correlations in the interspike interval sequence, in which case both the mean and the variance of the kth-order interval distribution grow linearly with k. Thus, for a renewal process, the variance-to-mean ratio of the kth-order interval distribution is a constant, independent of k. In the traditional approach, where one counts the number of spikes in windows of duration T, the variance-to-mean ratio of the count distribution is called the Fano factor (Fano, 1947). For a renewal model, the Fano factor asymptotically approaches a constant value for large T, but it is not constant for small count windows (Cox & Lewis, 1966). Thus, analysis of the kth-order interval distributions offers a more denitive test for deviations from a renewal process in the ISI sequence. In both the data and the model, regularization effects persist over time periods that are much longer than a single interspike interval. As described above, these longer-term effects can be quantied by observing the behavior of the variance-to-mean ratio of the kth-order interval distribution with increasing interval order k. As shown in Figure 3C1, the variance-to-mean ratio for the data falls rapidly for the rst 10 to 20 interval orders (approximately as k¡1 ). The behavior of the model is quite similar (see Figure 3C2). In the model, the dynamic threshold provides a long-term memory of previous spike activity, allowing regularizing effects to persist over multiple interspike intervals. Thus, the simple linear adaptive threshold model is able to capture the key features of spike train regularization observed in the experimental data.
Spike Train Regularization
1583
3.4 The Mathematical Basis of Long-Term Regularity in the Model. In this section, we explain how the mathematical structure of the linear adaptive threshold model gives rise to long-term regularity of the output spike train. Specically, we analyze spontaneous spike activity and show that the variance of the kth-order interval distribution Var ( Ik ) approaches a constant value for large k. The fact that the variance becomes independent of interval order k means, for example, that the variance in the distribution of time intervals between every thousandth spike in the spike train is essentially the same as the variance between every hundredth spike. This is in striking contrast to a renewal process model, for which the variance would continue to increase linearly with k, giving rise to a variance-to mean ratio that stays constant for all interval orders k. The key result regarding long-term spike train regularity for the linear adaptive threshold model is that Var (Ik ) approaches a constant for large k. Since the mean interval between spikes grows linearly with interval order k, the variance-to-mean ratio will fall as k¡1 , as illustrated in Figure 3. To understand why Var(Ik ) approaches a constant for large k, it is useful to recast the linear adaptive threshold model (equations 2.1–2.4) into a slightly different form. The new formulation gives rise to a set of spike times that are identical to those generated by the original model, but the internal state variables are handled differently. Rather than raising the threshold level by an amount b each time a spike occurs (as in equation 2.4), we will instead lower the mean voltage level by an amount b. Since the decision of whether to generate a spike (see equation 2.3) depends on only the relative difference between the threshold level and the voltage level, these two formulations will give rise to an identical set of spike times. Hence, either formulation can be used when analyzing the statistical properties of the output spike train. The two formulations of the linear adaptive threshold model are illustrated in Figure 4. Following the structure of the original model (equations 2.1–2.4), we express the reformulated model as:
(3.1)
v[n] D ci[n] C w[n] C vbase
h[n] D h[n ¡ 1] ¡ ( b / a )
(
s[n] D H (v[n] ¡ h[n]) D ( vbase D vbase ¡ bs[n] D
(3.2) 1 0
if v[n] ¸ h[n] otherwise
vbase ¡ b vbase
if s[n] D 1 otherwise,
(3.3)
(3.4)
where vbase is the newly introduced baseline voltage level, and all other variables are as dened previously. Note that only two of the equations have changed from the original model (equations 3.1 and 3.4), but all four have
1584
Relly Brandman and Mark E. Nelson
s[n]
A
[n] v[n]
s[n] [n] v[n]
B
b 2b 3b 4b Figure 4: Reformulation of the linear adaptive threshold model to facilitate the analysis of spike train properties. (A) Representative time history of spontaneous spike activity and internal state variables as originally formulated (see equations 2.1–2.4). (B) Time history of the state variables in the reformulated version of the model (see equations 3.1–3.4). The reformulated model gives rise to an identical set of spike times. Parameter values: a D 20 msec, s D 0.2 mV, b D 1 mV.
been rewritten above for convenience. In the reformulated model (equations 3.1–3.4), the threshold level h is never boosted; rather, it falls monotonically with a constant slope (see equation 3.2). For spontaneous spike activity, the input i is zero; thus, the voltage v is simply the baseline level vbase plus random noise (see equation 3.1). In this reformulated version of the model, the threshold falls linearly toward a noisy voltage oor; each time a spike is generated, the mean level of the oor drops by an amount b, as illustrated in Figure 4B. Now consider what happens in the reformulated model between spike i and spike i C k. Since k spikes were generated, the baseline level vbase will have dropped by an amount kb. If we choose k sufciently large ( k À s / b), then the drop in the baseline level kb will be much larger than the standard deviation s of the voltage uctuations around the baseline. Thus, the change in voltage level between the time of spike i and spike i C k is
D vi,i C k D ¡kb C O(s) ,
(3.5)
where O(s) is a small, random correction on the order of s related to the voltage uctuations around the baseline level. Since the threshold falls lin-
Spike Train Regularization
1585
early at a constant slope (¡b / a) and spikes are generated whenever the threshold crosses the voltage level, then the time difference between spike i and spike i C k is equal to the voltage difference divided by the threshold slope; thus:
D ti,i C k D D vi,i C k / (¡b / a )
D ak C O(as / b) .
(3.6)
Thus, for sufciently large k, the time interval between spike i and spike i C k is equal to ak, plus a small random correction on the order of as / b. As long as the threshold level starts well outside the noise band (kb À s), the variance of this random correction will be independent of k. Hence Var (Ik ) becomes constant for sufciently large k (k À s / b). Furthermore, the mean interspike interval hISIi is given by hISIi D lim
k!1
D ti,i C k k
D a,
(3.7)
as was noted in section 3.1. The two key results obtained above are that Var(Ik ) approaches a constant for large k and the mean ISI is equal to a. These two results are independent of the noise structure used in the model. We formulated the model using gaussian noise, but the same results would have been obtained for other forms, such as uniform or pink noise. The noise structure will have an effect on the asymptotic numerical value of Var ( Ik ) . However, the fact that Var (Ik ) approaches a constant value, and hence that the variance-to-mean ratio falls as k¡1 as shown in Figure 3C2, is a robust result that is independent of assumptions about the detailed noise structure. 4 Driven Response Characteristics of the Model
The driven response characteristics of the linear adaptive threshold model were evaluated using sinusoidal stimuli at frequencies between 0.1 and 100 Hz. In these simulations, the step size was taken to be 1 msec. The input signal was given by i[n] D S sin(2p f n / 1000), where S is the stimulus amplitude (arbitrary units) and f is the stimulus frequency (Hz). The total stimulus duration was 100 seconds at each stimulus frequency. The response gain and phase were computed using methods described in Nelson et al. (1997). Briey, cycle histograms of spike times were constructed and normalized such that the ordinate corresponded to the ring rate in spikes per second. A single cycle sinusoid was t to the cycle histogram r ( x ) D R sin(2p x C w ) C B, where x is the cycle fraction (0 · x · 1), R is the response amplitude, w is the response phase, and B is the baseline ring rate. The gain of the response at each frequency is computed as the ratio of response amplitude R to the stimulus amplitude S and has units of spikes per second per unit input. The phase of the response at each frequency is given by the best-t value of w (degrees).
1586
Relly Brandman and Mark E. Nelson
As illustrated in Figure 5, the linear adaptive threshold model has highpass lter characteristics. At low frequencies, the gain is proportional to the stimulus frequency, and the phase shift is 90 degrees, implying that the model behaves as a differentiator. At higher frequencies, the gain curve becomes at, and the phase drops toward zero. The overall gain of the response is determined by the model parameter b, which reects the amount that the threshold level is elevated following a spike. The larger the threshold boost, the lower the gain. In the low-frequency range, where the model behaves as a differentiator, the gain is equal to 2p f / b, with units of spikes per second per unit input. This functional form can be understood by considering the response of the model to a sinusoidal stimulus of amplitude S and frequency f . The rising phase of the sine wave will have a maximum slope of 2p f S. The rising slope will tend to shorten the mean interval between threshold crossings relative to baseline conditions. Recall that the threshold falls with a constant slope of ¡b / a (see equation 2.2), and the mean ISI under baseline conditions is equal to a (see equation 3.7). For a weak stimulus, a differential analysis reveals that the ISIs are shortened on average by an amount corresponding to a change in ring rate of 2p f S / b, and hence an overall gain of 2p f / b. If the input is scaled by an input gain c as in equation 2.1, then the overall gain becomes 2p f c / b. The parameter c is redundant in terms of being able to control the input-output gain of the model, since gain changes can be accomplished by changing b. However, as discussed in section 2, the parameter c is convenient if one wishes to interpret the model variables as currents and voltages. Empirically, the phase of the response remains unaffected by changes in gain (see Figure 5A). The corner frequency of the high-pass lter is determined by the model parameters a and s / b. As these values increase, the corner frequency decreases. Qualitatively, the location of the corner frequency is related to the timescale that characterizes the interval between successive spikes in the spike train. If the shape of the ISI distribution is such that almost all ISIs are short compared to the period of the stimulus, the model behaves as a differentiator. If either a (which controls the mean ISI) or s / b (which controls the CV) is large enough so that some of the ISIs in the spike train become comparable to the stimulus period, then the gain of the response begins to roll off, giving rise to the knee in the gain curve. Changes in the corner frequency also result in a corresponding change in the phase of the response (see Figure 5B). 5 Extensions to the Model
We now illustrate how one might extend the model to make it more biophysically plausible. For example, the extensions discussed here allow the model to match the experimentally measured frequency response characteristics of electrosensory afferent data better. The key point that we wish
Spike Train Regularization
1587
Figure 5: Frequency response characteristics of the linear adaptive threshold model. The model has high-pass lter characteristics. (A) Gain and phase for three different values of b, with a D 3 msec and s / b D 1. The gain has units of spikes per second per unit input. The gain varies inversely with b; the phase curves are overlapping and indistinguishable. (B) Gain and phase for three different values of a, with b D 0.1 mV and s D 0.1 mV. The parameter a inuences the corner frequency of the high-pass lter. Simulation duration was 100,000 time steps for a D 3 msec and a D 10 msec, and 500,000 time steps for a D 30 msec.
to make, however, is not that the extensions improve the t to empirical data, but rather that the extensions do not alter the long-term regularizing effects exhibited by the simpler model. In the linear adaptive threshold model, there were no dynamics associated with the membrane voltage v. Most neural modeling applications would want to include at least the effects of leaky integration by the cell membrane. This can be modeled as a rst-order low-pass lter with time constant tm , which is incorporated by replacing equation 2.1 with equations 5.1 and 5.2: u[n] D exp(¡1 / tm ) u[n ¡ 1] C [1 ¡ exp(¡1 / tm )]i[n]
(5.1)
v[n] D u[n] C w[n].
(5.2)
Note that the noise term w[n] is added to the output of the low-pass lter u rather than to the input. Thus, we consider the noise to reect stochastic properties that are intrinsic to the neuron rather than properties of the input signal. In terms of the frequency response characteristics, this extension to
1588
Relly Brandman and Mark E. Nelson
the model causes a roll-off in gain and a decrease in phase above the corner frequency ( fc D 1 / 2p tm ) of the low-pass lter. The second extension is to change the linear threshold decay function to a more biophysically plausible exponential decay toward a baseline level h0 , with a decay time constant th , as originally suggested by Chacron et al. (2000). This is incorporated by replacing equation 2.2 with equation 5.3: h [n] D exp(¡1 / th )h[n ¡ 1] C [1 ¡ exp(¡1 / th ) ]h0 .
(5.3)
This change in the representation of the threshold decay does not have a signicant effect on the general features of the rst-order ISI distribution (see Figure 6A) or the long-term regularization properties (see Figure 6B), but it does alter the frequency response characteristics of the model (see Figure 6C). Representative gain and phase plots for the extended model are shown in Figure 6C (solid lines). The change in frequency response characteristics for the extended model can be appreciated by comparing the general shapes of the gain and phase curves in Figure 6C with those for the simpler model shown in Figure 5. The parameters for the extended model were selected to closely match the average properties of P-type electrosensory afferents recorded in our experimental data (Nelson et al., 1997; Ratnam & Nelson, 2000). The extended model (equations 5.1–5.3, 2.3, and 2.4) is able to provide a good description of the response characteristics of P-type electrosensory afferents, including the baseline ISI distribution, interval correlations, and frequency response characteristics. However, the main point of this section is to demonstrate that the linear adaptive threshold model can be extended to match empirical data better, while maintaining the long-term regularizing effects that are of central importance here (see Figure 6B). 6 Weak Signal Detectability
In this section, we demonstrate that under certain circumstances, long-term spike train regularization can dramatically improve the detectability of a weak stimulus. We illustrate this by encoding a weak signal using two different neuron models: one that exhibits long-term spike train regularization and one that does not. The parameters of the two models are adjusted to have matched characteristics, including the mean and CV of the spontaneous ISI distribution and by the frequency response characteristics (gain and phase) of the driven response. Such characteristics are commonly used by neural modelers to assess how well a particular model describes experimental data. We show that although two models are well matched by these criteria, they can have signicantly different properties in terms of signal detectability. Our goal here is not to model any specic biological signal or system but rather to present a generic example illustrating the potential functional importance of long-term spike train regularization in
0.2 0.1 0
variance/mean
A
0
1
5 10 interval
B
0.1 0.01 1
10 100 order k
10
4
10
3
10
2
C
90 phase (deg)
prob
0.3
1589
gain (spikes/s/mV)
Spike Train Regularization
60 30 0 0.1
1 10 freq (Hz)
100
Figure 6: Frequency response characteristics of the exponential adaptive threshold model compared with experimental data. The left panels show the spontaneous spike train properties of the model: (A) ISI distribution and (B) varianceto-mean ratio of the kth-order interval distribution. The right panel (C) shows gain and phase of the driven response. The data points show the populationaveraged responses from 99 P-type electrosensory afferent bers (modied from Nelson et al., 1997). Error bars represent the standard deviation of the population average at each frequency. The continuous solid lines show the gain and phase of the exponential adaptive threshold model with b D 0.11 mV, s D 0.04 mV, h 0 D ¡1 mV, tf D 2 msec, and th D 30 msec. Simulation duration was 300,000 time steps.
biological systems and highlighting the importance of selecting a modeling framework that adequately accounts for correlations in the ISI sequence. 6.1 Linear Adaptive Threshold Model. For a model that exhibits longterm regularization, we use the simple form of the adaptive threshold model. Alternatively, we could have used the extended model, since it also exhibits long-term regularization, but the simple model embodies the essential features that are relevant for the comparison. For this example, we implement equations 2.1 through 2.4 with the following parameters: a D 20 msec, b D 0.5 mV, and s D 1 mV and a time step of 1 msec. This parameter set gives rise to a spontaneous ISI distribution with a mean of 20 msec and a CV of 0.69 (see Figure 7A1). For this example, we intentionally chose a s / b ratio that produces an irregular spike train on short timescales, as judged by the CV of the rst-order ISI distribution. The frequency response characteristics of the model are summarized in Figure 7B1. The model has high-pass
1590
Relly Brandman and Mark E. Nelson
lter characteristics with a corner frequency of about 8 Hz. The effects of long-term spike train regularization are shown in Figure 7C1, where it is seen that the variance-to-mean ratio for the kth-order interval distribution decreases as k¡1 . As discussed in section 3.4, this decrease in long-term variability arises from memory effects associated with the threshold dynamics.
6.2 Integrate-and-Fire Model with Random Threshold. We now wish to compare this model with one lacking any such memory effects. For the memoryless model, we also need to be able to adjust the mean and CV of the spontaneous ISI distribution, as well as the frequency response characteristics. These criteria can be satised by using a stochastic integrate-and-re model with random threshold (Gabbiani & Koch, 1996, 1998), coupled with a linear prelter to adjust the frequency response characteristics. In this model, the input signal i is passed through a unity-gain high-pass prelter with time constant t f and summed with a constant bias input Ib , which controls the spontaneous ring rate of the model. This input signal is integrated on each time step. When the integrated signal v exceeds a threshold h, a spike is generated (s D 1) . Following a spike, v is reset to zero, and h is reset to a new random value drawn from a gamma distribution of order m. Because the reset values contain no information about the previous state of the system, there are no memory effects in the ISI sequence of this model. In a discrete-time representation, this memoryless model including the high-pass prelter is described by the following update rules:
f [n] D exp(¡1 / tf ) f [n ¡ 1] C [1 ¡ exp(¡1 / tf )]i[n]
(6.1)
v[n] D v[n ¡ 1] C i[n] ¡ f [n] C Ib
(6.2)
s[n] D H (v[n] ¡ h[n])
(6.3)
v[n] D (1 ¡ s[n]) v[n]
(6.4)
h[n] D (1 ¡ s[n])h[n] C s[n]gm [n],
(6.5)
where gm [n] are random values drawn from a gamma distribution of order m with mean xN (Gabbiani & Koch, 1998): gm ( x ) D cm ( x / xN ) m¡1 exp(¡mx / xN )
(6.6)
with cm D
mm 1 . ( m ¡ 1) ! xN
(6.7)
The random threshold model as described above has four free parameters: tf , Ib , m, and x. N
Spike Train Regularization
1591
Adaptive Threshold (non renewal) A1
prob
0.04
phase(deg)
gain
100
0
25
50 75 ISI (ms)
100
0
100
B1
10
1 90
1 90
60
60
30
30
0
0
25
50 75 ISI (ms)
100
B2
0 0.1
var./mean
mean = 20.1 cv = 0.68
0.02
10
10
A2
0.04
mean = 20.0 cv = 0.69
0.02 0
Random Threshold (renewal)
1 10 freq (Hz)
100
C1
0.1
10
1
1 10 freq (Hz)
100
C2
1
0.1
0.1 1
10 100 interval order k
1
10 100 interval order k
Figure 7: Comparison of two neural models with matched spontaneous ISI and driven response characteristics. The left side (A1–C1) shows results for the linear adaptive threshold model (see equations 2.1–2.4), while the right side (A2–C2) shows an integrate-and-re-based model with a random threshold (see equations 5.1–5.5). The model parameters were adjusted to yield similar rstorder ISI distributions (A1, A2) and similar frequency response characteristics (B1, B2). However, the higher-order interval statistics, as characterized by the variance-to-mean ratio of kth-order interval distribution, are quite different (C1, C2). The linear adaptive threshold model exhibits strong regularizing effects at large interval orders, whereas the random threshold model has variance-tomean that is independent of interval order.
1592
Relly Brandman and Mark E. Nelson
6.3 Comparison of Response Characteristics. The response properties of the stochastic integrate-and-re model are shown in Figure 7 for t f D 20, Ib D 0.51, m D 2, and xN D 10. The mean and variance of the spontaneous ISI distribution (see Figure 7A2) are almost identical to those of the adaptive threshold model (see Figure 7A1). Also, the frequency response characteristics of the two models are very similar (see Figures 7B1 and 7B2). However, the random threshold model has no memory effects in the ISI sequence. Hence, for spontaneous spike activity, each interspike interval is independent of the previous interval. For such a renewal process model, both the mean and variance of the kth-order interval distribution grow linearly with interval order k, and hence the variance-to-mean ratio is independent of k (see Figure 7C2). Thus, we see that the two models have almost identical response characteristics, except for their long-term regularity as measured by the kth-order interval variance-to-mean ratios. 6.4 Comparison of Signal Detectability. We now provide a weak input signal to both model neurons and evaluate how reliably the signal can be detected in the output spike train. Specically, we consider a single-cycle sinusoidal input signal with amplitude A and duration D, satisfying the boundary conditions that the stimulus level and slope are zero at the beginning and end of the stimulus cycle. In discrete time, the input signal can be represented as
i[n] D A[1 ¡ cos(2p n / D )].
(6.8)
In order to highlight the effects of long-term spike train regularization, we consider the case where the stimulus duration spans multiple interspike intervals. The mean interspike interval for the two matched models is 20 msec, as determined from a 10 s interval of simulated baseline activity with no stimulus present. In the following example, we consider an input signal with duration D D 1000 msec, such that on average, about 50 spikes occur during a stimulus cycle. The stimulus amplitude is chosen to be A D 0.25. The average response to 1000 presentations of this stimulus is shown in Figure 8A1 for the linear adaptive threshold model and in Figure 8A2 for the random threshold model. In both cases, the response is sinusoidal with an amplitude of approximately 3 spikes per second. Note that the phase is shifted by approximately 90 degrees relative to the stimulus. This is because the neurons are operating as differentiators at this stimulus frequency and are thus responding to the slope of the stimulus rather than its absolute magnitude. As can be seen by comparing Figures 8A1 and 8A2, there is no obvious difference in response gain or variability in the poststimulus rate histograms, nor is there any obvious difference in the short-term variability of the individual spike trains shown in the dot raster displays. This similarity in the response properties of the two models is not surprising, given that they were tuned to have matching characteristics. Although the properties
Spike Train Regularization
1593
of the two models are similar on average, the detectability of the stimulus on a trial-by-trial basis is dramatically different. The stimulus does not change the mean number of spikes observed during a trial. Rather, there is a slight increase in the spike count during the rst half of the trial and a slight decrease during the last half. For this particular stimulus amplitude, there is a mean increase of one spike in the rst half of the trial and a mean decrease of one spike in the second half of the trial, relative to the baseline level. To characterize the detectability of this small change in the spike train statistics, we presented each neuron model with a set of randomized trials, half of which contained a stimulus (see equation 6.8) and half of which did not. The detection task requires making a prediction on a trial-by-trial basis of whether the stimulus was present, based on the binary spike train data si for that trial. Since the mean number of spikes does not change in the presence of the stimulus, this decision cannot be based on the total spike count. To detect the stimulus optimally, the spike train data are passed through a lter with an impulse response matched to the expected temporal prole of the signal (Kay, 1998). In this case, the matched lter m is well approximated by a single-cycle sinusoid with zero phase shift, m[n] D sin(2p n / D ) ,
(6.9)
and the output of the matched lter zi on trial i is zi D
D X
m[n]si [n].
(6.10)
nD 1
Figures 8B1 and 8B2 show distributions of the matched lter output for the two models, in both the presence and absence of the stimulus. For both models, the matched lter output has a mean near zero when no stimulus is present and a mean of approximately 1.5 when there is a stimulus. Although the shift in the mean is approximately the same for both models, the width of the distribution is signicantly narrower for the adaptive threshold model (s.d. ¼ 0.6) than for the random threshold model (s.d. ¼ 3.4). This difference in variability has a signicant impact on weak signal detectability. The output of the matched lter zi can be used as a test statistic for binary hypothesis testing, in which the goal is to decide on a trial-by-trial basis whether a stimulus has occurred based on the value of zi for that trial. In this simple case, the problem can be handled using the classical NeymanPearson approach (Kay, 1998). For each trial i, the lter output zi is compared with a threshold value zthresh. If the lter output is greater than the threshold value, the detector classies the trial as a stimulus trial. Depending on the threshold level that is selected, there will be some detection probability Pd of correctly classifying a trial that contained a stimulus as a stimulus trial and some false alarm probability Pfa of misclassifying a trial without
1594
Relly Brandman and Mark E. Nelson
rate (Hz)
Adaptive Threshold A1
60 50 40
Random Threshold 60 50 40
A2
dot raster stim
detection probability
number of trials
75
0 10 5 0 5 10 matched filter output zi
20 no B2 stimulus 15 10 5 0 stimulus 15 10 5 0 10 5 0 5 10 matched filter output zi
0.75
0.75
0.5
0.5
0.25
0.25
0
0
B1
50
no stimulus
25 0 stimulus
50 25
1 C1
0 0.25 0.5 0.75 1 false alarm probability
1 C2
0 0.25 0.5 0.75 1 false alarm probability
Figure 8: Comparison of signal detectability for the linear adaptive threshold (A1–C1) and random threshold (A2–C2) models. The upper panels (A1, A2) show the stimulus waveform (arbitrary units), dot raster displays of representative spike activity, and poststimulus rate histograms computed by averaging spike activity over 1000 stimulus trials. A solid white line shows a sinusoidal t to the response. The middle panels (B1, B2) show histograms of the matched lter output for trials with and without a stimulus. The bottom panels (C1, C2) illustrate the dramatic improvement in detectability for signals encoded by the adaptive threshold model relative to the random threshold model, as measured by the ROC curves. The dashed lines indicate chance-level performance.
Spike Train Regularization
1595
a stimulus as a stimulus trial. If the threshold value is moved lower to improve detection efciency, the false alarm probability also increases. This trade-off between detection probability and false alarm probability can be summarized by the receiver operating characteristic (ROC) of the detector, which is a parametric plot of Pd versus Pfa as a function of threshold zthresh. The ROC plots for the two neuron models are shown in Figures 8C1 and 8C2. The ability to detect reliably the presence of the stimulus is much better for signals encoded by the adaptive threshold model. For example, if the threshold is set at a level corresponding to a false alarm probability of 10%, the probability of detecting the stimulus is 90% in spike trains arising from the adaptive threshold model but only 19% in spike trains from the random threshold model. 7 Conclusion
Spike trains that appear irregular on short timescales can exhibit longer-term regularity in their ring pattern. This regularity arises from the correlation structure of the ISI sequence and involves memory effects spanning multiple interspike intervals (Ratnam & Nelson, 2000). This form of long-term spike train regularization can arise from the refractory effects associated with a dynamic spike threshold (Chacron et al., 2001). The functional relevance of spike train regularity is supported by our experimental data on prey capture behavior of weakly electric sh. In our analysis of electrosensory afferents (Ratnam & Nelson, 2000), we found that spike train regularity was most pronounced on timescales of about 40 interspike intervals, which corresponds to a time period of about 175 msec. This timescale is well matched to the relevant timescales for prey capture behavior in these animals (Nelson & MacIver, 1999; MacIver, Sharabash, & Nelson, 2001). The timescale approximately matches the duration that the electrosensory image of a small prey would activate a single electrosensory afferent ber. We speculate that spike train regularization on the timescale of tens to hundreds of milliseconds may play a key role in enhancing the detectability of natural sensory signals, not just in the electrosensory system but in other systems as well. Regularizing effects, although not as pronounced, have been observed on similar timescales in auditory afferents (Lowen & Teich, 1992). Whether such effects exist in other systems is largely unknown because the appropriate analyses of multiscale spike train variability have not been carried out. The effects of spike train regularization can be most readily observed in experimental data by analyzing the variance-to-mean ratio of the kth-order interval distributions Ik. For a renewal process, which lacks correlations in the interval sequence, the variance-to-mean ratio is constant for all interval orders k. A decrease in the variance-to-mean with increasing k indicates a regularizing effect, whereas an increase indicates that the spike train is becoming more irregular. Asymptotically, similar relationships hold for the analysis of spike count distributions, where the variance-to-mean ratio is
1596
Relly Brandman and Mark E. Nelson
referred to as the Fano factor (Fano, 1947; Gabbiani & Koch, 1998). However, because the Fano factor decreases initially even for a renewal process, the effects of intermediate-term spike train regularization can be overlooked in a Fano factor analysis. Therefore, we recommend the analysis of kth-order interval distributions as the best approach for characterizing spike train variability on multiple timescales. We have presented a simple model, derived from a more detailed model by Chacron et al. (2000, 2001), that exhibits long-term spike train regularization arising from refractory effects associated with a dynamic spike threshold. Memory effects associated with the threshold dynamics give rise to negative correlations in the ISI sequence; hence, this is a nonrenewal model of spike generation. Many common neural models, including those based on integrate-and-re dynamics or inhomogeneous Poisson processes, do not produce correlations in the ISI sequence, and hence are classied as renewal models. Recent models of electrosensory afferent dynamics, including our own, fall into the category of renewal process models (Nelson et al., 1997; Kreiman, Krahe, Metzner, Koch, & Gabbiani, 2000). While such renewal models can accurately match the mean and CV of the rst-order ISI distribution, as well as the frequency response characteristics of the experimental data, their failure to generate longer-term spike train regularization may make them unsuitable for applications in which it is important to estimate accurately detection thresholds or coding efciency for weak sensory stimuli. Given that refractory effects are commonplace in neural systems, we suspect that this form of spike train regularization may be more widespread than previously appreciated. Hence, nonrenewal models, such as the one presented here, may have broad applicability when modeling the encoding of weak signals in neuronal spike trains. Acknowledgments
This research was supported by grants from the National Science Foundation (IBN-0078206) and the National Institute of Mental Health (R01MH49242). References Berry, M. J., & Meister, M. (1998).Refractoriness and neural precision. J. Neurosci., 18, 2200–2211. Calvin, W. H., & Stevens, C. F. (1968). Synaptic noise and other sources of randomness in motoneuron interspike intervals. J. Neurophysiol., 31, 574–587. Chacron, M. J., Longtin, A., & Maler, L. (2001). Negative interspike interval correlations increase the neuronal capacity for encoding time dependent stimuli. J. Neurosci., 21, 5328–5343. Chacron, M. J., Longtin, A., St.-Hilaire, M., & Maler, L. (2000). Suprathreshold stochastic ring dynamics with memory in P-type electroreceptors. Phys. Rev. Lett., 85, 1576–1579.
Spike Train Regularization
1597
Cox, D. R. (1962). Renewal theory. London: Methuen. Cox, D. R., & Lewis, P. A. W. (1966). The statistical analysis of series of events. London: Methuen. Fano, U. (1947). Ionization yield of radiations. II. The uctuations of the number of ions. Phys. Rev., 72, 26–29. Gabbiani, F., & Koch, C. (1996). Coding of time-varying signals in spike trains of integrate-and-re neurons with random threshold. Neural Comp., 8, 44–66. Gabbiani, F., & Koch, C. (1998). Principles of spike train analysis. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling:From ions to networks (pp. 313–360). Cambridge, MA: MIT Press. Johnson, D. H. (1996). Point process models of single-neuron discharges. J. Comput. Neurosci., 3, 275–299. Johnson, D. H., Tsuchitani, C., Linebarger, D. A., & Johnson, M. J. (1986). Application of a point process model to responses of cat lateral superior olive units to ipsilateral tones. Hear. Res., 21, 135–159. Kay, S. M. (1998). Fundamentals of statistical signal processing, Volume II: Detection theory. Englewood Cliffs, NJ: Prentice Hall. Kreiman, G., Krahe, R., Metzner, W., Koch, C., & Gabbiani, F. (2000). Robustness and variability of neuronal coding by amplitude sensitive afferents in the weakly electric sh Eigenmania. J. Neurophysiol., 84, 189–224. Kufer, S. W., Fitzhugh, R., & Barlow, H. B. (1957). Maintained activity in the cat’s retina in light and darkness. J. Gen. Physiol., 40, 683–702. Lowen, S. B., & Teich, M. C. (1992). Auditory-nerve action potentials form a nonrenewal process over short as well as long time scales. J. Acoust. Soc. Am., 92, 803–806. MacIver, M. A., Sharabash, N. M., & Nelson, M. E. (2001). Prey-capture behavior in gymnotid electric sh: Motion analysis and effects of water conductivity. J. Exp. Biol., 204, 534–557. Moortgat, K. T., Keller, C. H., Bullock, T. H., & Sejnowski, T. J. (1998). Submicrosecond pacemaker precision is behaviorally modulated: The gymnotiform electromotor pathway. Proc. Natl. Acad. Sci., 95, 4684–4689. Nelson, M. E., & MacIver, M. A. (1999). Prey capture in the weakly electric sh Apteronotus albifrons: Sensory acquisition strategies and electrosensory consequences. J. Exp. Biol., 202, 1195–1203. Nelson, M. E., Xu, Z., & Payne, J. R. (1997). Characterization and modeling of P-type electrosensory afferent responses to amplitude modulations in a wave-type electric sh. J. Comp. Physiol. A, 181, 532–544. Ratnam, R., & Nelson, M. E. (2000). Non-renewal statistics of electrosensory afferent spike trains: Implications for the detection of weak sensory signals. J. Neurosci., 20, 6672–6683. Scheich, H., Bullock, T. H., & Hamstra, R. H., Jr. (1973). Coding properties of two classes of afferent nerve bers: High frequency electroreceptors in the electric sh Eigenmannia. J. Neurophysiol., 36, 39–60. Stein, R. B. (1967). The frequency of nerve action potentials generated by applied currents. Proc. Roy. Soc. Lond., B167, 64–86. Received May 25, 2001; accepted October 31, 2001.
LETTER
Communicated by Jonathan Victor
Spatiotemporal Spike Encoding of a Continuous External Signal Naoki Masuda
[email protected] Department of Mathematical Engineering and Information Physics, Graduate School of Engineering, University of Tokyo, Tokyo, Japan Kazuyuki Aihara
[email protected] Department of Complexity Science and Engineering, Graduate School of Frontier Sciences, University of Tokyo, Tokyo, Japan Interspike intervals of spikes emitted from an integrator neuron model of sensory neurons can encode input information represented as a continuous signal from a deterministic system. If a real brain uses spike timing as a means of information processing, other neurons receiving spatiotemporal spikes from such sensory neurons must also be capable of treating information included in deterministic interspike intervals. In this article, we examine functions of neurons modeling cortical neurons receiving spatiotemporal spikes from many sensory neurons. We show that such neuron models can encode stimulus information passed from the sensory model neurons in the form of interspike intervals. Each sensory neuron connected to the cortical neuron contributes equally to the information collection by the cortical neuron. Although the incident spike train to the cortical neuron is a superimposition of spike trains from many sensory neurons, it need not be decomposed into spike trains according to the input neurons. These results are also preserved for generalizations of sensory neurons such as a small amount of leak, noise, inhomogeneity in ring rates, or biases introduced in the phase distributions. 1 Introduction
Information processing mechanisms in brain have been disputed for a long time. The ring rate is understood to represent information in the external stimuli (Tuckwell, 1988; Koch, 1999). However, the rate coding works too slowly to explain some quick functions that occur in the brain because the rate calculation involves temporal averaging of spike signals. Consequently, the brain may also make use of precise spike timing in information processing. Temporal spike coding enables the brain to treat information at increased speed with decreased energy. In addition, precisely timed reproc 2002 Massachusetts Institute of Technology Neural Computation 14, 1599– 1628 (2002) °
1600
Naoki Masuda and Kazuyuki Aihara
ducible spiking in response to dynamic stimuli has been reported with a precision of milliseconds in rat neocortex (Mainen & Sejnowski, 1995), y visual system (van Steveninck, Lewen, Strong, Koberle, & Bialek, 1997), and cat lateral geniculate nucleus of the thalamus (LGN) (Warzecha & Egelhaaf, 1999; Reinagel & Reid, 2000). And the related mechanisms that enable neural networks to enhance ring time precision such as those of inhibitory connected integrate-and-re (IF) neurons (Tiesinga & Sejnowski, 2001; Tiesinga, Fellous, JosÂe, & Sejnowski, 2001) have been explored. Theoretical computation models that use spike timing in neural networks have also been proposed (Judd & Aihara, 1993, 2000; Fujii, Ito, Aihara, Ichinose, & Tsukada, 1996). Brains may be able to select between strategies and choose the one that is site and situation appropriate (Riehle, Grun, ¨ Diesmann, & Aertsen, 1997). Firing time and interspike intervals (ISIs) can, at least theoretically, play important roles in information encoding. The usefulness of ISI time series for information processing was shown by Sauer (1994). It is possible to reconstruct the attractor of the external stimuli from the ISI time series of an IF neuron when the neuron receives continuous chaotic signals (Sauer, 1994, 1997). Sauer used a simple IF neuron that possessed internal state x ( t) equal to the perfect integration of the external input S ( t) with a positive bias until it res. Once ring occurs, x ( t) is reset to the resting potential to restart the integration. The chaotic input S (t) is generated from the Rossler ¨ equations (Rossler, ¨ 1976). We assume the ring threshold and the resting potential to be x D 1 and x D 0, respectively. The ith ring time Ti is determined from the ( i ¡ 1) th ring time Ti¡1 and S ( t) by Z
x ( Ti ) D
Ti Ti¡1
S ( t) dt D 1.
(1.1)
The strange attractor generating S ( t) can be reconstructed from the delay coordinates with the ISI time series fti I ti D Ti ¡ Ti¡1 g (Sauer, 1994). It is also possible to predict fti g with the local prediction models for chaotic time series (Sugihara & May, 1990). Sauer (1997) showed that the generic IF neuron maintains a one-to-one correspondence between the system states and the ISI vectors of a sufciently large dimension. This guarantees attractor reconstruction from the ISI time series. The dynamical characteristics of the original attractors such as the correlation dimension are also calculated from the ISI time series (Castro & Sauer, 1997b). The ring threshold corresponds to a Poincare section (Suzuki, Aihara, Murakami, & Shimozawa, 2000). The attractor can be reconstructed from ISIs since an ISI represents 1 / S ( t) averaged over the ISI (Racicot & Longtin, 1997). An increased ring threshold indicates averaging of 1 / S ( t) over longer ISIs, so the ISI reconstruction degrades (Sauer, 1994; Racicot & Longtin, 1997). ISI reconstruction methods have also been studied for other types of neurons (Racicot & Longtin, 1997; Castro & Sauer, 1999) and in experiments with the CA1 region of the rat hippocampus (Schiff, Jerger, Chang, Sauer, & Aitken, 1994), rat cutaneous
Spatiotemporal Spike Encoding of a Signal
1601
afferents of hairy thigh skin (Richardson, Imhoff, Grigg, & Collins, 1998), and cricket sensory neurons (Suzuki et al., 2000). The continuous signal S ( t) in these studies can be interpreted to model dynamically changing external stimuli such as visual scenes and sounds. Consequently, the focus has been ISI encoding of sensory neurons. The next logical step is to study the meaning of ISIs for the cortical neurons that receive spatiotemporal spike inputs with deterministic information from sensory neurons. Experimental results also suggest that spike timing plays an important role in nonsensory neurons such as the mushroom body and b-lobe neurons that receive spikes from other neurons (MacLeod, B¨acker, & Laurent, 1998). Experimental results on the synchronization of neurons (Eckhorn et al., 1988; Gray, Konig, ¨ Engel, & Singer, 1989; Laurent & Davidowitz, 1994; Vaadia et al., 1995; Riehle et al., 1997; Stopfer, Bhagavan, Smith, & Laurent, 1997; Stopfer & Laurent, 1999; Brecht, Singer, & Engel, 1999; Rodriguez et al., 1999) are also relevant to the importance of spike timing. Accordingly, it is worth examining attractor reconstruction with the ISI time series generated by spike-driven neurons to gain insight into information transmission and processing of the cortical neurons rather than the sensory neurons. The simplest way to approach this problem is to approximate spike inputs by continuous signals. However, the validity of this generalization is nontrivial for the following reason. Cortical neurons receive spikes from many sensory neurons and feedback spikes from cortical neurons. These neurons generally must receive many input spikes to re, depending on nerve membrane leakage, ring rates of incident spike trains, and many other variables. The input spikes arrive from many other neurons that emit different spike trains. It is not at all clear whether the superimposition of these different spike trains can be replaced approximately by a single continuous signal. The incident spike train to a cortical neuron includes superimposed spike trains arriving from many sensory neurons. Even if the ISI of each input neuron contains deterministic information about stimuli, the ISI of the superimposed incident spike trains does not necessarily preserve the deterministic information because during the ISI of an input neuron, there generally exist spikes from other input neurons (Ichinose & Aihara, 1998; Isaka, Yamada, Takahashi, & Aihara, 2001). Moreover, it is mathematically quite difcult for a cortical neuron to decompose incident spikes according to their source neurons. This spike superimposition problem is specic to spike-driven neurons. Real neurons seem to treat the stimulus information without spike decomposition. In the light of these observations, we explore attractor reconstruction by ISI time series of a single cortical neuron receiving spatiotemporal spike inputs from many sensory neurons. Sensory neurons are modeled by the IF neurons (Sauer, 1994, 1997) for the sake of simplicity, but we also discuss some generalizations. For postsynaptic neurons, we use the IF neurons, the leaky integrate-and-re (LIF) neurons (Tuckwell, 1988; Racicot & Longtin, 1997; Castro & Sauer, 1999), and the FitzHugh-Nagumo (FHN) neurons
1602
Naoki Masuda and Kazuyuki Aihara
(FitzHugh, 1961; Nagumo, Arimoto, & Yoshizawa, 1962; Racicot & Longtin, 1997; Castro & Sauer, 1997a, 1999). We show that the original attractor can be reconstructed from the ISI in the cases of the IF and LIF spike-driven neurons without assuming strong synapses or without decomposing superimposed spike trains. Consequently, we see that the spike-driven neuron collects small contributions from many sensory neurons to gain the stimulus information. The FHN neurons differs intrinsically from the other two models and cannot well preserve deterministic information, as has been shown in the continuous-input models (Racicot & Longtin, 1997; Castro & Sauer, 1997a, 1999). We consider spike effects on excitable neurons under ordinary conditions; in so doing, we focus exclusively on the excitable neuron models and do not consider the oscillatory neuron models, which has been widely studied. We cover ISI reconstruction of single neurons driven by spatiotemporal spikes from many sensory model neurons in section 2. We use the IF, LIF, and FHN neurons. The mechanisms underlying the ISI reconstruction by these neurons are discussed in section 3. We also compare information processing of model neurons receiving spatiotemporal spike inputs with that of continuous-input neurons. 2 Interspike Interval Reconstruction by Spike-Driven Neurons 2.1 Sensory Neurons. Throughout this article, we study two-layered neural networks. For simplicity, we call the neurons in the rst layer sensory neurons and those in the second layer cortical neurons. The sensory neurons are assumed to be the IF neurons receiving common continuous input S (t) , which represents external stimuli. We generate S ( t) from the x variable of the Rossler ¨ attractor (Sauer, 1994, 1997; Racicot & Longtin, 1997; Castro & Sauer, 1997a, 1999). The Rossler ¨ equations (Rossler, ¨ 1976) are represented by 8 dX < dt D a (¡Y ¡ Z) , dY (2.1) D a ( X C 0.36Y) , dt : dZ (0.4 ( ) C ¡ 4.5) D a Z X , dt
and the parameters are chosen so that the Rossler ¨ system generates a typical chaotic signal. The velocity of points on orbits is modulated by a and a D 34.48. In this situation, the autocorrelation of the Rossler ¨ system decays to a small enough value in 5 to 10 times the usual ISI duration. Accordingly, we can focus on information processing of neurons over this timescale. If we chose a larger value for a in equation 2.1, we would concern ourselves with shorter timescales. The external stimulus S ( t) is dened as follows: S ( t) D 2.7 ¢ 10¡4 C 2.2 ¢ 10¡5 X.
(2.2)
We chose the parameters in equation 2.2 so that S (t ) > 0 almost always; S ( t) may represent external stimuli that do not take negative values. Biologically,
Spatiotemporal Spike Encoding of a Signal
1603
sensory inputs are rarely rigorously chaotic. We use chaotic S (t) because we are testing short-term information coding rather than classical rate coding, which requires a long time to obtain ring rates by averaging uctuations. The prediction errors are signicantly small only for a short prediction period because of the exponential information decay of chaotic signals. We are interested in information processing performed on the timescale over which prediction is possible. As will be shown later, the prediction errors of the ISI time series can be used to measure the performance of the external signal reconstruction. We think that our results can be extended to situations with more general external signals such as ltered gaussian noise. Examination of these situations will involve other measures for performance evaluation such as correlation functions between external signals and estimated signals. We assume that there are n1 sensory neurons. We denote by Ti, j (1 · i · n1 ) the jth ring time of the ith neuron. By setting the threshold and the resting potential to 1 and 0, respectively, the dynamics of the membrane potential xi (t) of the ith sensory neuron between the j ¡ 1th ring and the jth ring is described by Z xi (t) D
t
S ( t) dt.
(2.3)
Ti, j¡1
As shown in equation 1.1, we know from equation 2.3 that Ti, j ( j ¸ 2) is determined from Ti, j¡1 and S (t) by Z
Ti, j Ti, j¡1
S (t) dt D 1,
( j ¸ 2).
The jth ISI of the ith neuron is dened by ti, j D Ti, j ¡ Ti, j¡1 where Ti, 0 ´ 0. We do not impose any further assumptions on the sensory neurons; the initial phases xi (0) (1 · i · n1 ) are chosen randomly: xi (0) D hi ,
(1 · i · n1 ) ,
(2.4)
where hi is randomly chosen from the uniform distribution on [0, 1]. With equation 2.4, we observe that the initial ring time Ti,1 is given by Z
Ti, 1 0
S (t ) dt D 1 ¡ hi .
The randomness in the initial phases will turn out to be important for information processing in terms of the attractor reconstruction, and the result indicates that noise plays a positive role in the brain; noise drives the phase distribution to become approximately uniform so that the sensory layer can
1604
Naoki Masuda and Kazuyuki Aihara
IF x1 x2
v S(t) xn1 Figure 1: Network architecture when there is only one cortical neuron. S ( t) denotes the continuous external stimulus, xi denotes the membrane potential of the ith sensory neuron, and v denotes the membrane potential of the cortical neuron.
collect versatile information about S ( t) as a whole. This noise effect differs from the effects of deterministic chaos and stochastic resonance. We consider information coding by a single cortical neuron in this article. The network architecture is schematically shown in Figure 1. The cortical neuron has a membrane potential denoted by v ( t) , and it receives spatiotemporal spikes from n1 sensory neurons. The ISI of the spike train out of each sensory neuron is assumed to be deterministic. However, the ISI of the input spikes to the cortical neuron is no longer deterministic because of spike superimposition. In response to the superimposed spike inputs, the cortical neuron emits a spike train, the determinism of which is discussed below. We examine three models of cortical neurons: the IF, LIF, and FHN neurons. 2.2 The Integrate-and-Fire Neuron. Here, we investigate the dynamics of the IF cortical neuron. The membrane potential v (t ) of the cortical neuron changes by the same mechanism observed for xi ( t) ; only the input differs. The cortical neuron is assumed to receive instantaneous spikes with homogeneous amplitude 2 N at time T i, j . Consequently, its dynamics is described by
dv X D 2 Nd ( t ¡ T i, j ) , dt i, j
(2.5)
Spatiotemporal Spike Encoding of a Signal
1605
6 where d is the delta function dened by d (0) D 1 and d ( t) D 0 for t D 0. We denote by T 0k the kth ring time of the cortical neuron. The dynamics of v (t ) ( t 2 [T 0k¡1 , T 0k]) is represented by
Z v (t) D
X
t 0
Tk¡1
2N
0 ·Ti, j ·t i, jITk¡1
d ( t ¡ Ti, j ) dt.
The next ring time T 0k is the instant when v ( t) reaches 1, and the ISI of the output spikes from the cortical neuron is dened by t0k D T 0k ¡ T 0k¡1 ,
( k ¸ 1),
where T 00 ´ 0. To examine the property of the superimposed spike inputs, we suppose that 1 > h1 ¸ h2 ¸ ¢ ¢ ¢ ¸ hn1 ¸ 0 without losing generality. It follows that 0 < T1,1 · T2,1 · ¢ ¢ ¢ · Tn1 ,1 < T1,2 · T2,2 · ¢ ¢ ¢ ,
(2.6)
since Z
Ti C 1, j Ti, j
Z S ( t) dt D
TiC 1, j¡1
Z
Z
S ( t) dt C
TiC 1, j¡1
D 1C D
Z
TiC 1, j
Ti, j¡1
TiC 1, 1
Ti C 1, j¡1
Ti, j¡1
Z S ( t) dt C
Ti, j¡1
S (t) dt Ti, j
S ( t) dt ¡ 1
S ( t) dt
Ti,1
D hi ¡ hi C 1 ¸ 0,
(1 · i · n1 ¡ 1) ,
(2.7)
and Z
T1, j C 1 Tn 1 , j
Z S ( t) dt D
S (t) dt
Tn 1 , 1
Z D
T1, 2
T1, 2
T1,1
Z S (t) dt ¡
Tn 1 , 1
S ( t) dt
T1, 1
D 1 ¡ (h1 ¡ hn1 ) > 0.
(2.8)
We denote by TN j and tNj D TN j ¡TN j¡1 the time of the jth input spike to the cortical
1606
Naoki Masuda and Kazuyuki Aihara
neuron and its ISI, respectively. With equation 2.6, TN j and tNj are represented by ( TN i C n1 ( j¡1) D Ti, j ,
tNi C n1 ( j¡1 ) D
T1, j ¡ Tn1 , j¡1 , Ti, j ¡ Ti¡1, j ,
(i D 1), (2 · i · n1 ) ,
where we assume delay from the sensory neurons to the cortical neurons to be 0, or equivalently homogeneous. In spite of determinism of the ISIs fti, j : j 2 N g for 1 · i · n1 , the time series ftNj g is not deterministic because of the random choice of hi . We treat the information in fTN j g and ftNj g without decomposing the superimposed spikes into groups corresponding to the spike sources with determinism. We see from equation 2.5 that the cortical neuron can re only at an instance when it receives a spike. Given T 0k¡1 D TN j0 , T 0k is determined by Z v ( T 0k ) D
Tk0
X
0 Tk¡1
j
2 Nd ( t ¡ TN j ) dt D 1.
Accordingly, T 0k D TN j0 C N , where N is the number of input spikes necessary for the cortical neuron to re. We can assume without loss of generality that 2 N satises ND
1 2N
.
Next, we dene Z
Dj
D
TN jC 1 TN j
S (t) dt > 0.
(2.9)
Using equations 2.7 and 2.8, we have j0 C n1 ¡1 X j D j0
Dj
D 1,
(8 j0 ¸ 1).
T 0k is determined from T 0k¡1 and S ( t) by Z
Tk0 0 Tk¡1
Z S (t) dt D
TN j0 C N TN j0
S (t ) dt D
j0 C N¡1 X j D j0
Dj.
(2.10)
Spatiotemporal Spike Encoding of a Signal
1607
P j C N¡1 The distribution of jD0 j0 D j can be determined from the uniform distribution of hi . Using equations 2.7 through 2.9, we can assume that j0 C N¡1 X
Dj
j D j0
D h N0 C 1 ¡ h 1 C
N ¡ N0 , n1
(2.11)
where N 0 D N mod n1 . Sinceh1 , h2 , . . .,hn1 are the reordered series of random variables selected from the uniform distribution restricted to [0, 1], we have Prob (hN 0 C 1 ¡ h1 2 [x, x C dx] ) D ( n1 ¡ 1)
0
0
¢ n1 ¡2 CN0 ¡1 xN ¡1 (1 ¡ x) n1 ¡N ¡1 dx,
(1 · N 0 < n1 ) ,
(2.12)
where n1 ¡2 CN0 ¡1 denotes the binomial coefcient. Equations 2.11 and 2.12 result in 0 1 j0 C N¡1 X N D jA D (2.13) E@ n1 jD j 0
and 0 V@
j0 C N¡1 X j D j0
1
Dj A D
N 0 ( n1 ¡ N 0 ) , n21 (n1 C 1)
(2.14)
where E (¢) and V (¢) are the mean and variance of a random variable, respectively. We note that equations 2.13 and 2.14 do not depend on j0 . Equations 2.10 and 2.13 indicate that the cortical neuron works as if it were the IF neuron driven by the continuous input S (t ) with the threshold equal to N / n1 . However, we must also note that the ensemble average with respect to realization is not generally equal to the time average because the ergodicity does not hold once the initial phases are xed. Strictly speaking, the thresholds are equal to N / n1 only when there is an equidistant distribution of initial phases xi (0) on [0, 1]. If we introduce inhomogeneity of ring rates so that the ring-rate deviations among neurons are incommensurable with each other, the ergodicity will hold. Additive noises in the sensory neurons also realize ergodicity. Under these conditions, equation 2.13 is valid for every realization with E (¢) interpreted as time average. Equation 2.14 suggests that this approximation holds with the error proportional to n1¡1 . This error is regarded as temporal deviation in ergodic cases and as deviation of realizations in nonergodic cases. The cortical neuron is expected to transmit deterministic information passed through spatiotemporal spikes from sensory neurons in most cases.
1608
Naoki Masuda and Kazuyuki Aihara
The variation represented in equation 2.14 may cause degradation of the attractor reconstruction. The stochastic uctuation in D j results in inconstant ring thresholds. However, this effect is small in biologically plausible cases where n1 » D 20000 and N0 D N » D 200 (Koch, 1999). Each sensory neuron roughly encodes information about the inverse signal amplitude 1 / S ( t) by emitting spikes at the proper time. The cortical neuron collects these pieces of signal information to realize the attractor reconstruction. Superimposition of the incident spike trains received by the cortical neuron is not an obstacle for signal reconstruction. The random initial phase condition is essential since it enables sensory neurons to collect various aspects of the information signal by ring spikes at different instances. If all the sensory neurons share an identical initial condition, they always re simultaneously when noise is absent. Thus, the model will reduce to the case of the single sensory neuron with the spike amplitude modied to n12 N . In this case, the ISI reconstruction degrades seriously since the n1 sensory neurons simultaneously sample the external signal information. Accordingly, the sampling intervals of the sensory neuron layer are large, and the cortical neuron receives only a restricted aspect of the signal information. Biologically speaking, the information contained in hi gradually degrades due to other noise sources. However, the phases xi ( t) (1 · i · n1 ) of sensory neurons may remain random in noisy environments. The noise plays a positive role here. To evaluate how deterministic the ISI time series ft0kg is, we follow the methods usually used in nonlinear analysis of ISI time series (Sauer, 1994, 1997; Racicot & Longtin, 1997; Castro & Sauer, 1997a, 1997b, 1999; Suzuki et al., 2000). We examine the predictability of the time series by the local prediction algorithm (Sugihara & May, 1990). In this algorithm, we transform ft0kg with d-dimensional delay coordinates to obtain the points (t0k, t0k¡1 , . . . , t0k¡d C 1 ) in a d-dimensional reconstruction space (Sauer, 1994). To perform h-step prediction of t0k0 , we search l0 nearest neighbors ( t0kl , t0kl ¡1 , . . . , t0kl ¡d C 1 ) , 1 · l · l0 to (t0k , t0k ¡1 , . . . , t0k ¡d C 1 ). The prediction tO0k C h is then 0 0 0 0 determined by tO0k0 C h D
l0 1X t0 . l0 lD1 kl C h
The effectiveness of this algorithm for ft0k g is evaluated by the normalized prediction error (NPE), dened by
NPE (h ) D
h( tO0k0 C h ¡ t0k0 C h ) 2 i1 / 2 h(m ¡ t0k0 C h ) 2 i1/ 2
,
where m is the mean of ft0kg. The denominator of NPE ( h ) is the expected prediction error for predicting by the average. Accordingly, NPE ( h ) is close to 1 when the prediction is not accurate, for example, when predicting chaotic
Spatiotemporal Spike Encoding of a Signal
1609
time series with large uctuations by just the average. However, more accurate prediction of predictable time series including deterministic series results in smaller NPE ( h) values for sufciently small values of h. Since stochastic time series with correlation can also result in a small NPE ( h) , we also calculate the NPE ( h ) of the surrogate data (Theiler, Eubank, Longtin, Galdrakian, & Farmer, 1992; Chang, Schiff, Sauer, Gossard, & Burke, 1994) to distinguish the predictability owing to the determinism from that of stochastic time series. In the surrogate data method, we compare NPE (h ) of the original time series to that of the stochastic time series articially generated from the null hypotheses for the original time series. In this article, we test two kinds of surrogate data: the Fourier shufed (FS) surrogate and the amplitude-adjusted Fourier transform (AAFT, also called gaussian-scaled shufed) surrogate. In the FS surrogate, we rst obtain the Fourier transform surrogate data generated from the null hypothesis that the time series is generated from a linear stochastic process. The FS surrogate data are generated by rearranging the original time series in accordance with the rank order of the FT surrogate data. Consequently, the FS surrogate data preserve the amplitude distribution of the original time series, and at the same time, the autocorrelation of the FS surrogate data is close to that of the original time series. In the AAFT surrogate, the null hypothesis is that the time series is a linear stochastic process observed through a monotonic nonlinear function. The AAFT surrogate data preserve the amplitude distribution, and the autocorrelation is also similar to that of the original time series. The time series is suggested to be deterministic if and only if the NPE ( h) s of the FS and AAFT surrogate data are close to 1, whereas the NPE ( h) of the original time series is signicantly smaller than that of the surrogate data. To obtain the error bar, we generate 100 surrogate data for each kind of surrogate and calculate the standard deviations of the NPE ( h ) s. The simulation results for the neural network with the IF cortical neuron are shown in Figure 2. We set n1 D 160 and 2 N D 0.007, and the result is that the cortical neuron receives N D 143 spikes to trigger one spike, a value that is consistent with values obtained experimentally (Diesmann, Gewaltig, & Aertsen, 1999; Koch, 1999). The timescale is chosen so that the ring rate of the cortical neuron is 20 Hz (Laurent & Davidowitz, 1994; Stopfer et al., 1997; Stopfer & Laurent, 1999). The length of ft0kg used for the nonlinear prediction and the surrogate data is equal to 4096, and the last 10% of the ISIs is predicted by the rst 90% with l0 D 12. We nd that the original ISI time series ft0kg made from the output spikes of the cortical neuron has an NPE ( h) that is signicantly lower than NPE ( h )s of surrogate data for various values of h (see Figure 2b) and various embedding dimensions (see Figure 2c). Figure 2d shows the attractor reconstructed from ft0kg using 4096 points; this attractor is similar to the original Rossler ¨ attractor. Consequently, nonlinear determinism of ft0kg is strongly suggested. The IF neuron driven by superimposed spike trains can also transmit the deterministic information in the sensory stimuli.
1610
Naoki Masuda and Kazuyuki Aihara
1.2 A
0.8 NPE
v (t)
1
0.5
B
1 0.6 0.4
original FS AAFT
0.2 0
0 0
200
400
1.2
C
1 original FS AAFT
0.4 0.2 0
2
4
6
4
6
8
10 12
8
embedding dimension
10
D
80 t_ {i- 1 } [m s ]
NPE
0.8 0.6
2
prediction horizon
time (ms)
60 40 40
60
80
t_i [ms]
Figure 2: Simulation results for the neural network with the IF cortical neuron. n1 D 160 and 2 N D 0.007. (A) The voltage v ( t ) of the cortical neuron. (B) NPE ( h) for the ISI time series ft0k g of the output spikes of the cortical neuron (original), the FS surrogate data (FS), and the AAFT surrogate data (AAFT). The error bars for the estimates of NPE ( h ) show the one-s ranges based on 100 samples. d D 4, l 0 D 12, and the length of ft0k g is 4096. (C) NPE (1) with various embedding dimensions for ft0k g (original), the FS surrogate data (FS), and the AAFT surrogate data (AAFT). (D) The attractor reconstructed from 4096 points of ft0k g in the delay coordinates with d D 2. Only the rst 300 points are connected by lines.
We again note that the ISI of each sensory neuron carries only a small amount of information about S ( t) . The cortical neuron aggregates the superimposed spikes from many sensory neurons to recover more precise signal information. Figure 3 shows the dependence of the prediction errors on n1 . We have kept 2 N n1 constant to generate Figure 3 so that the ring rate of the cortical neuron does not change. Figure 3 suggests that the ISI reconstruction is difcult if the cortical neuron receives incident spikes from n1 · 40 neu-
Spatiotemporal Spike Encoding of a Signal
0.8 0.7
NPE
0.6
1611
NPE(1) NPE(2) NPE(3)
0.5 0.4 0.3 0.2 0.1 0
20 40 60 80 100 120 140 160 number of sensory neurons (n_1)
Figure 3: Dependence of the prediction errors NPE (1) , NPE (2) , and NPE (3) on the number of sensory neurons n1 . d D 4.
rons. To achieve signal reconstruction with a smaller n1 , it is more helpful to apply mechanisms to guarantee nearly uniform distribution of fhi g. For example, inhibitory interaction between the sensory neurons would keep hi s almost equally separated (Mar, Chow, Gerstner, Adams, & Collins, 1999). In this case, we may reduce n1 required for maximum phase separation to less than d ( < 1) from n1 / d ¡2 to n1 / d ¡1 . To show that the phase dispersion of the sensory neurons is essential, we examine the effect of uneven initial conditions. Figure 4 shows the dependence of the prediction errors NPE (1), NPE (2), and NPE (3) on the width W of the initial phase distributions. The case with W D 1 corresponds to the totally random initial condition studied above. The prediction performance P j C N¡1 degrades as W becomes smaller. When W < 1, the quantity jD0 j0 D j appearing in equations 2.13 and 2.14 is subject to systematic deviations. If D j in the summation involves an interval with width 1 ¡ W where no hi can exist, then the summation is nearly equal to 1 ¡ ( n1 ¡ N ) W / ( n1 C 1) . Otherwise, it is nearly equal to NW / ( n1 C 1) . Consequently, the cortical neuron actually res with frequent changes in threshold. The deviation increases when W decreases, which causes higher prediction errors for smaller Ws. 2.3 Generalization of Sensory Neurons. Here, we show that the mechanism proposed in section 2.2 also holds in cases with a small amount of leak and inhomogeneity in ring rates introduced in the sensory neurons.
1612
Naoki Masuda and Kazuyuki Aihara
Figure 4: Dependence of the prediction errors NPE (1) , NPE (2) , and NPE (3) on the width W of the distribution of the initial phases. d D 4.
First, let us suppose the presence of realistic leaks in the sensory neurons. In this case, the sensory neurons tend to re synchronously even without feedback coupling because S ( t) is common to all the neurons. As a result, the cortical neuron cannot collect the signal information effectively because the phases are no longer dispersed. However, noise applied to the sensory neurons yields successful attractor reconstruction because additive noise drives the sensory neurons out of synchronization to make the phases more dispersed so that the instants at which they re are different. If the noises are not so large as to break the determinism of the ISIs, signal reconstruction is also possible for leaky sensory neurons. As an example, let us apply an independent gaussian noise to each sensory neuron with dynamics represented by equation 2.19. When c D 0.01 ms¡1 and the standard deviation of the noise is equal to 0.105 ms¡1 , we have NPE (1) D 0.40, which indicates determinism of the ISI time series. Next, we consider the effect of inhomogeneity. The real neurons are inhomogeneous even if they belong to the same functional assembly. Let us suppose inhomogeneity in the ring rates of the sensory neurons in which the ring rate of the ith neuron is 1 C D fi times larger than that of homogeneously ring neurons. We assume that D f1 , D f2 , . . ., and D fn1 are incommensurable so that the phase trajectory is ergodic. In this case, we can obtain the distribution of the ISI by taking the ensemble average with respect to h1 , h2 , . . ., and hn1 . We also suppose that N0 D N and that D fi is small enough
Spatiotemporal Spike Encoding of a Signal
1613
to prevent multiple ring of a sensory neuron between two spikes from the cortical neuron. Noting that ring is caused by the ith sensory neuron with probability (1 C D fi ) / n1 under ergodicity, we have Prob ( ISI 2 [t, t C dt]) D
n1 n1 X 1 C D fis X (1 C D fie ) dt n1 6 is is D1 ie D Á ! X X ¢ d N ¡1 ¡ si 6 ie ,is iD
6 ie ,is g2f0,1gn 1 ¡2 fsi Ii D
£ D
n1 Y 6 ie ,is iD
[(1 C D fi ) tsi C (1 ¡ (1 ¡ D fi ) t ) (1 ¡ si ) ]
N ( N C 1) dt n 1 t2 ¢
X fsi g2f0,1gn 1
£
n1 Y iD 1
Á
d NC1¡
X
! si
i
[(1 C D fi ) tsi C (1 ¡ (1 ¡ D fi ) t) (1 ¡ si ) ],
(2.15)
where si is equal to 1 when the ith neuron res in [0, t] or otherwise equals 0, and d is the delta function. We denote the index of the neuron that upon last ring caused the last ring of the cortical neuron as is and the index of the sensory neuron that marks the Nth spike since it as ie . Accordingly, this spike makes the cortical neuron re. Equation 2.15 coincides with equation 2.12 when D fi D 0 for all i. By expanding to the second order in D fi , we have Prob ( ISI 2 [t, t C dt])
¡N¡1 D ( n1 ¡ 1) n1 ¡2 CN¡1 tN¡1 (1 ¡ t) n1 dt
C
C
N ( N C 1)dt X D fi f¡n1 ¡1 CN C 1 tN C 1 (1 ¡ t) n1 ¡N¡2 n1 t i 2N ( N C 1) dt X D fi n1 i< j
C n1 ¡1 CN tN (1 ¡ t ) n1 ¡N¡1 g
£ D fj fn1 ¡2 CN C 1 tN C 1 (1 ¡ t) n1 ¡N¡3 ¡ 2n1 ¡2 CN tN (1 ¡ t) n1 ¡N¡2 C n1 ¡2 CN¡1 tN¡1 (1 ¡ t ) n1 ¡N¡1 g.
(2.16)
1614
Naoki Masuda and Kazuyuki Aihara
1
NPE(1) NPE(2) NPE(3)
0.8 0.6 0.4 0.2 0
0
0.05 0.1 0.15 0.2 0.25 standard deviation in the firing rate
Figure 5: Dependence of the prediction errors NPE (1) , NPE (2) , and NPE (3) on the standard deviation of the ring rate inhomogeneity D fi . We assume for D fi the gaussian distribution with mean 0. d D 4.
Using equation 2.16, we have E (t) D
N C O ( D fi3 ) n1
(2.17)
and Var (t) D
X 2N (N C 1) N ( n1 ¡ N ) C D fi D fj C O ( D fi3 ). n21 (n1 C 1) ( n1 ¡ 1)n21 ( n1 C 1) i < j
(2.18)
Equation 2.17 indicates that the effective threshold for the cortical neuron is equal to that in the homogeneous case obtained in equation 2.13 up to O ( D fi2 ) . Equations 2.14 and 2.18 show that the dependence of Var(t) on O (D fi ) also vanishes. The dependence of the prediction errors NPE (1) , NPE (2) , and NPE (3) on D fi is shown in Figure 5. The deviations D fi ’s are taken from the gaussian distribution with various standard deviations. Figure 5 shows that the prediction is accurate enough even under 25% inhomogeneity in the ring rates. 2.4 The Leaky Integrate-and-Fire Neuron. Next we examine the possibility of attractor reconstruction by IF-type neurons with a leak. The ISI reconstruction by these models was originally studied by Racicot and Longtin
Spatiotemporal Spike Encoding of a Signal
1615
(1997) and Castro and Sauer (1999) for continuous chaotic inputs with the LIF neurons (Tuckwell, 1988; Koch, 1999). The LIF neurons have also been used to investigate the collective behavior of spike-coupled neural networks (van Vreeswijk, 1996; Ernst, Pawelzik, & Geisel, 1998; Bressloff & Coombes, 2000; Masuda & Aihara, 2001). Racicot and Longtin (1997), Castro and Sauer (1999), and Suzuki et al. (2000) studied the dynamics of a single LIF neuron using the following model: dx D S ( t ) ¡ c x, dt
(2.19)
where x ( t) is the membrane potential, c > 0 is the leak rate, and S ( t) is the chaotic signal that is essentially the same as that in equation 2.1. The LIF neuron has the same ring mechanism as that of the IF neuron with threshold (x(t) D 1) and resting potential (x(t) D 0). In contrast to the IF neuron model, ISI attractor reconstruction and prediction of ISI time series with a little degradation are possible. By integrating equation 2.19, we have Z x ( t) D
t
0
S (t0 )e ¡c ( t¡t ) dt0 ,
(2.20)
0
where we assume that x (0) D 0. Equation 2.20 shows that S ( t0 ) weighted 0 with a linear lter e ¡c ( t¡t ) is integrated to determine the next ring time (Racicot and Longtin, 1997). This lter can cause the degradation or the failure of the ISI reconstruction since it can seriously affect the integration process. In particular, if S (t) is small while x ( t) is close to the threshold, the leak term c x ( t) in equation 2.19 is large for a long time. Unlike the IF neuron model, this situation results in a leak-induced long ISI, which can degrade the prediction. However, the ISI time series is predictable when c is not very large. Here, we examine the ISI reconstruction when the LIF neuron is used as the model of the cortical neuron. The sensory neurons are the same as ones used in section 2.2. Thus, we apply the model without voltage bias (Tuckwell, 1988; Masuda & Aihara, 2001). The dynamics of the LIF neuron with spatiotemporal spike inputs is given by X dv D 2 Nd ( t ¡ TN j ) ¡ c v ( t ) . dt j
(2.21)
Equation 2.21 coincides with equation 2.5 when c D 0. By integrating equation 2.21, we have Z v ( T 0k ) D
Tk0 0
Tk¡1
X j
2 Ne
¡c ( Tk0 ¡t0 )
d ( t0 ¡ TN j ) dt0
1616
Naoki Masuda and Kazuyuki Aihara
D 2N
jX 0 C Nl
0
N
e ¡c ( Tk ¡Tj )
j D j0
D 1,
(2.22)
which enables us to determine T 0k from T 0k¡1 . Nl in equation 2.22 can be determined uniquely. Substituting 2N D
1 1 D N NDj
Z
TN jC 1 TN j
S ( t0 ) dt0
into equation 2.22, we obtain Z TN j C 1 j CN 1 0X l 1 0 N S ( t0 )e ¡c ( Tk ¡Tj ) dt0 D 1. D N jD j j TN j
(2.23)
0
If D j is approximated by its mean given in equation 2.13, then equation 2.23 reduces to jX 0 C Nl j D j0
Z
TN j C 1
TN j
S ( t0 ) e ¡c
( Tk0 ¡TN j )
dt0 D
N . n1
(2.24)
Comparing equations 2.20 and 2.24, we expect the LIF neuron receiving spike inputs from the sensory neurons to operate like the LIF neuron receiving the continuous input S ( t) with the threshold equal to v ( t) D N / n1 . Consequently, an LIF cortical neuron can transmit the deterministic information contained in the external stimuli and passed through spatiotemporal spikes from the sensory neurons. Equation 2.24 suggests two possible factors for the degradation that occurs in the ISI reconstruction and not in the LIF neuron with continuous inputs as studied by Racicot and Longtin (1997) and Castro and Sauer (1999). As in the IF neuron model, the variance in D j given in equation 2.14 can cause the effective thresholds of the cortical 0
N
neuron to change. The other factor stems from the lter weight e ¡c (Tk ¡Tj ) 0 0 in t0 2 [TN j , TN j C 1 ] instead of e ¡c ( Tk ¡t ) . This effect is essentially due to the discrete-time sampling in response to spike inputs. Figure 6 shows the simulation results for the neural network with the LIF cortical neuron. The parameters for the sensory neurons and the timeseries length used in the analysis are the same as for the IF neuron model. We use c D 0.055 ms¡1 according to Diesmann et al. (1999) and Koch (1999). We set n1 D 160 and 2 N D 0.018 to maintain a ring rate equal to 20 Hz as in the IF neuron model. The results of the surrogate data analysis in Figures 6b and 6c imply that the ISI time series is deterministic. Furthermore, the reconstructed attractor shown in Figure 6d looks like a twisted version
Spatiotemporal Spike Encoding of a Signal
1617
Figure 6: Simulation results for the neural network with the LIF cortical neuron. n1 D 160, 2 N D 0.018, and c D 0.054 ms¡1 . (A) v ( t) . (B) NPE ( h ) for ft0k g (original) and the surrogate data (FS,AAFT). d D 4, l0 D 12, and the length of ft0k g is 4096. (C) NPE (1) with various embedding dimensions for ft0k g (original) and the surrogate data (FS,AAFT). (D) Attractor reconstructed from 4096 points of ft0k g in delay coordinates with d D 2. Only the rst 300 points are connected by lines.
of the one shown in Figure 2d. Nevertheless, it preserves a certain degree of determinism, which makes accurate deterministic prediction possible. As a result, the ISI of the LIF neuron with spike inputs can encode sensory information. The prediction errors NPE (1) , NPE (2) , and NPE (3) for varying leak rate c are shown in Figure 7. We use d D 4 for predictions and change 2 N so that the ring rate of the cortical neuron is kept almost constant. In this way, we eliminate dependence of the prediction error on the ring rate. We see that the determinism of S (t) is preserved with sufciently high quality when c < 0.065 ms¡1 . This range for c includes biologically plausible values
1618
Naoki Masuda and Kazuyuki Aihara
1
NPE
0.8 0.6 0.4 0.2 0
NPE(1) NPE(2) NPE(3) 0.02
0.04
0.06
0.08
gamma Figure 7: Dependence of the prediction errors NPE (1) , NPE (2) , and NPE (3) on c . d D 4.
(Diesmann et al., 1999; Koch, 1999). For larger c , the cortical neuron tends to re only in the vicinity of a large peak of S (t ), where the cortical neuron receives incident spikes with a sufciently high rate. In contrast, the cortical neuron cannot re when S ( t) is small. The sensory neurons are close to the synchronized or clustered state in this case. This point will be considered again in section 3. 2.5 The FitzHugh-Nagumo Neuron. The FHN neuron model (FitzHugh, 1961; Nagumo et al., 1962) can be used to describe dynamical properties of real neurons, such as ring with thresholds, relative refractoriness, and cooperation of fast and slow variables. The dynamics of the original FHN neuron is described by the following equations:
dv D ¡v(v ¡ 0.5) ( v ¡ 1) ¡ w C I, dt dw D v ¡ w ¡ 0.15, dt
a
where a ¿ 1, I is the external input, v is the fast variable related to the membrane potential, and w is the slow variable. The FHN neuron is oscillatory when I > I0 where I0 is the Hopf bifurcation point. Attractor reconstruction with ISI time series of the FHN neuron has been investigated (Racicot &
Spatiotemporal Spike Encoding of a Signal
1619
Longtin, 1997; Castro & Sauer, 1997a, 1999) when I is replaced by a continuous external signal similar to S ( t) dened in equation 2.2. The subthreshold external signal with noise (Castro & Sauer, 1997a) and the suprathreshold external signal (Racicot & Longtin, 1997; Castro & Sauer, 1999) have been studied. In both cases, the attractor reconstruction is difcult unless the range of the external signal is carefully chosen or the dynamics of the external signal is adequately slow (Castro & Sauer, 1999). The FHN neuron can operate as the amplitude-to-frequency converter only in limited cases, which are not biologically plausible (Racicot & Longtin, 1997). The external inputs inuence the ring rate less than in the IF and LIF cortical neuron models. Here, we use the FHN neuron as the cortical neuron model. Although the FHN neuron, which possesses a property whereby the spike width depends on the driving current (I or S ( t) ), is not a very good cortical neuron model, the following results will hold for other neuron models of a similar type, as we explain in section 3. The general properties associated with the incapability of attractor reconstruction will be discussed in the next section. The sensory neurons are the same as the ones used in sections 2.2 and 2.4. Since we are interested in the effect of input spikes, we use the bias current I represented by X Nj ) , (2.25) I D IO C 2 Nd ( t ¡ T j
where IO < I0 , and 2 N is the amplitude of each input spike. Consequently, the input spikes must arrive from the sensory neurons at a sufciently high rate to make the cortical neuron re. Figure 8 shows the simulation results for a neural network with an FHN cortical neuron. The parameters for the sensory neurons and the time-series length are the same as before. We set the ring threshold for the cortical FHN neuron at v D 0.7, n1 D 160, 2 N D 0.015, a D 0.05, and IO D 0.145 so that the ring rate equals 20 Hz, which is comparable to that of the previous simulations. The NPE ( h) for ft0kg is close to 1 (see Figures 8b and 8c). The reconstructed attractor shown in Figure 8d is split into clusters (Racicot & Longtin, 1997). A cluster indicates how many subthreshold oscillations are contained in two successive ISIs, each of which corresponds to a coordinate in Figure 8d. Furthermore, the determinism is lost even if we consider the restricted transitions from one specic cluster to another. These features differ qualitatively from the Rossler ¨ attractor and the attractors reconstructed from the ISI of the IF neuron (see Figure 2d) and LIF neuron (see Figure 6d), implying that the deterministic prediction is impossible. These results are consistent with those for the continuous-input cases (Racicot & Longtin, 1997; Castro & Sauer, 1997a, 1999). In the next section, we discuss these phenomena in comparison to the results for the IF neuron, the LIF neuron, and continuous inputs.
1620
Naoki Masuda and Kazuyuki Aihara
Figure 8: Simulation results for the neural network with the FHN cortical neuron. n1 D 160, 2 N D 0.015, a D 0.05, IO D 0.145, and ring threshold v D 0.7. (A) v ( t) . (B) NPE (h ) for ft0k g (original) and the surrogate data (FS,AAFT) with d D 6. (C) NPE (1) with various embedding dimensions for ft0k g (original) and the surrogate data (FS,AAFT). (D) The attractor reconstructed from 4096 points of ft0k g in delay coordinates with d D 2. Only the rst 300 points are connected by lines.
3 Mechanisms for Attractor Reconstruction by Spike-Driven Neurons
Here, we explore the mechanisms of attractor reconstruction by single neurons receiving incident spike trains with deterministic ISIs from many sensory neurons. As discussed by Racicot & Longtin (1997), an ISI roughly represents 1 / S (t) at time t (see section 1). We start by arguing the continuousinput case. When the input is continuous rather than discontinuous with spikes, three factors may cause degradation of the ISI reconstruction.
Spatiotemporal Spike Encoding of a Signal
1621
First, the time window size for integration may contribute to degradation. If the threshold of the IF or LIF neuron is increased, reconstruction becomes poorer since the longer time integration of 1 / S ( t) should be taken between two adjacent output spikes (Sauer, 1994, 1997; Racicot & Longtin, 1997). The second factor involves how much the ISIs change when S ( t) changes. In principle, if the spike frequency is a monotonic function of S ( t) , the neuron can work as an amplitude-to-frequency converter, resulting in successful ISI reconstruction (Castro & Sauer, 1999). Sensitivity of the frequency to the change in S ( t) is also important, although the monotonicity in the frequency amplitude curve is mathematically sufcient for the ISI reconstruction. In regard to amplitude-to-frequency conversion, biological and model neurons can be divided into two classes according to qualitative properties of repetitive ring (Hodgkin, 1948; Koch, 1999; Izhikevich, 2000). When neurons with class I excitability begin repetitive ring, the ring starts with an innitesimally low frequency. However, when neurons with class II excitability begin repetitive ring, they start ring with a nite frequency. In the model neurons, these classications result from the different bifurcation structures. Homoclinic bifurcations or saddle-node bifurcations on invariant circles often underlie class I excitability, whereas subcritical Hopf bifurcations typically underlie class II excitability. Strictly speaking, the quality of attractor reconstruction is not directly ascribed to the bifurcation structure itself when the repetitive ring emerges or terminates. However, this classication is also related to the frequency-modulation property of model neurons in response to the change in the external inputs. In class I neurons, the dynamic range in the ring rate is generally large (5–150 Hz) with varying bias (Hodgkin, 1948; Izhikevich, 2000). When the bias is just slightly above the threshold, the state of the neuron remains for a long time near the saddle or the ruin of the stable xed point, which existed with subthreshold bias. For larger bias, the whole system is free from slow transient behavior, and the dynamics is governed by the limit cycle. In contrast, the frequency changes relatively little (75–150 Hz) in class II neurons (Hodgkin, 1948; Izhikevich, 2000). Both the IF and LIF neurons belong to class I (Koch, 1999; Izhikevich, 2000). With equations 2.3 and 2.20, the instantaneous ring rate f at time t is given as follows: (IF)
(LIF)
dx D S ( t) , dt ¿ S ( t) ln . f Dc S ( t) ¡ c f D
The dynamic range of f for the LIF neuron is almost as large as that for the IF neuron unless S ( t) ¡ c is slightly positive or S ( t) · c . In the latter case, f is undened since the LIF neuron cannot re if S ( t) is always less than c . However, the FHN neuron belongs to class II (Koch, 1999; Izhikevich,
1622
Naoki Masuda and Kazuyuki Aihara
2000). Although the amplitude of an individual action potential is sensitive to S ( t) , the ring rate is not. The third important feature is dependence of the external signal effect on the phase variable v ( t) of the cortical neuron. Although the change in S (t ) should affect the ring rate to realize good signal reconstruction, the difference in the phase when the neuron receives a spike should not affect the ring rate. This is because the phase-dependent lters deteriorate the equal integration of information signals. The effect of S (t) on the IF neuron does not depend on the value of x (t ). In contrast to the IF neurons, the ex0 N N ponential weight e ¡c ( t¡Tj ) in equation 2.20 or e ¡c (Tk ¡Tj ) in equation 2.24 in the integration by the LIF neuron can degrade the attractor reconstruction. This is mainly because the importance of S (t ) in determining the ring time depends on t. The effect of the inputs to the FHN neuron has much stronger phase dependence. Only the inputs received around the threshold crossing can signicantly inuence the dynamics of the FHN neuron. Furthermore, S (t ) that precedes ring can also delay the ring if it is applied in the refractory period (Hansel, Mato, & Meunier, 1995). Although we cannot dene the phase of the FHN neuron analytically except in a simplied version (Ichinose, Aihara, & Judd, 1998), it is geometrically dened by projecting the two-dimensional dynamics onto a unit circle (Hansel et al., 1995; Bressloff & Coombes, 2000). These three observations for continuous-input situations explain why the attractor reconstruction is possible for the IF and LIF neurons and difcult for the FHN neuron. The properties are naturally inherited in spike-driven neurons. In spike-driven situations, phase-response curves (PRC) are useful for describing the effect of an input spike (Hansel et al., 1995; Ichinose et al., 1998; Bressloff & Coombes, 2000; Masuda & Aihara, 2001). The phase w 2 [0, 1] is dened so that the phase velocity is constant when the neuron is ring periodically. The resting potential and the ring threshold are usually identied with w D 0 and w D 1, respectively. The phase description can be extended to the excitable neuron models by interpreting the leak effect as phase regression (Masuda & Aihara, 2001). PRC is the mapping from w to tN (w ) ¡ w , where tN (w ) transforms the old phase to the new phase when an input spike arrives. Neurons with relatively at PRCs can transmit the ISI determinism since the spikes are evaluated evenly during the cycle. For example, the phase and PRC of the IF neuron are trivially dened by w Dx
(3.1)
tN (w ) ¡ w D 2 N ,
(3.2)
and
respectively. We can immediately see that every spike contributes to the ring dynamics equally. The phases for the LIF neuron are dened (Hansel
Spatiotemporal Spike Encoding of a Signal
1623
et al., 1995; Masuda & Aihara, 2001) by w ´ g ( x) D ¡
1 lnf1 ¡ x (1 ¡ e ¡c )g c
(3.3)
and tN (w ) ¡w D g ( g ¡1 (w ) C 2 N ) ¡w D ¡
1 lnf1¡2 N ec w (1¡e ¡c ) g c
( > 0) ,
(3.4)
respectively. With the limit c ! 0, the denitions in equations 3.3 and 3.4 approach those in equations 3.1 and 3.2, respectively. Equation 3.4 indicates that the phase changes more when the membrane potential is closer to the threshold. This observation is consistent with what is deduced from equation 2.24. The phase dependence manifests itself more in the spike-input case than in the continuous-input case because a spike instantaneously results in a nite amount of phase change. The effect of spikes on the FHN neuron can also be explained by numerically obtained PRC. The PRC of the FHN neuron strongly depends on the phase; spikes have signicant effects only around the threshold crossing (Hansel et al., 1995). The information contained in most of the input spikes is lost without inuencing the dynamics of the FHN neuron. This is the main reason that the reconstruction is difcult. The reconstruction is possible only when the information signal changes slowly relative to the ring rate. In this situation, the phase dependence is averaged to make the attractor reconstruction feasible (Castro & Sauer, 1999). Nevertheless, the information encoding performance would be low in this case since only slowly changing signals are recognized. The condition required for the attractor reconstruction is related to but different from the criteria given by Hansel et al. (1995), where synchronization of neurons coupled by gap junction is discussed. The neurons tend to desynchronize when each neuron has a PRC with tN (w ) ¡w > 0 for 8w (type I) and synchronize when the PRC has two regions: one with tN (w ) ¡ w > 0 and the other with tN (w ) ¡w < 0 (type II). Actually, the IF and LIF neurons belong to the type I class, and the FHN neuron belongs to the type II class. Unlike type II neurons, many type I neurons are capable of attractor reconstruction because the PRCs of type I neurons are relatively at on many occasions. We must take three more factors into consideration to understand the dynamics of spike-input neurons. One is the time discretization in sampling. This is a consequence of the fact that the neuron receives external information in the form of point processes with instantaneous spikes. As shown in section 2.4, the weighted integration of the signal is replaced by the weighted discrete sum. Another factor is the uncertainty in the spike arrival time caused by the random initial conditions of the sensory neurons in the framework presented here and other noise sources or inhomogeneity among neurons in more general situations. The variance in the spike arrival
1624
Naoki Masuda and Kazuyuki Aihara
time is evaluated in equation 2.14, which shows the asymptotics as / n1¡1 for large n1 . The last factor is the distinction between oscillatory dynamics and excitable dynamics. We have concentrated on the excitable cases to explore spike effects. There also exist oscillatory schemes corresponding to each of the three neuron models discussed here. In oscillatory schemes, the constant biases applied to the neuronal dynamics make the neurons re even in the absence of spike inputs. For example, the FHN neuron is inherently oscillatory when IO in equation 2.25 is larger than I0. Generally, when the bias is stronger, the dynamics of the output ISI is governed more by the biases than by the external inputs. In other words, the properties of the ISI time series are determined by the balance between the strength of the external input and that of the inherent neuronal dynamics comprising systematic biases applied to the neurons. In summary, the ISIs of the spike-driven neurons are deterministic when the corresponding neurons with continuous inputs are deterministic. In both cases, certain conditions are important for signal reconstruction: the spike effect should be independent of the phase, and the change in the input signal amplitude or the input spike frequency should change the output spike frequency (class I property).
4 Conclusion
We have investigated the possibility of signal reconstruction in single cortical neurons receiving incident spike trains from many sensory neurons with ISIs that reect deterministic characteristics of continuous external stimuli. The IF and LIF neurons can transmit the dynamical and deterministic information even though they receive superimposed spike trains where deterministic structures of ISI in incident spike trains from sensory neurons interfere with each other. The noises in the sensory neurons are important for spike-driven cortical neurons to collect signal information effectively. The spike-driven FHN neurons, however, with biologically plausible parameters cannot adequately encode sensory stimuli. For these three types of cortical neuron models, we have discussed the conditions for ISI reconstruction; the spike effect should not depend on the phase v (t ) of the cortical neuron and should depend monotonically on the input frequency. The difference between the spike-input models and the continuous-input models is ascribed to the difference between the discrete and continuous sampling of the information signal. Our results support the notion that the attractor reconstruction mechanism is an information processing scheme in the brain as well as a useful tool for investigating the effects of external and feedback spike inputs. The next problem we will approach is the two-layered neural network with multiple cortical neurons. Many experiments have shown the existence of synchronously ring neurons, and this synchronization is believed to play
Spatiotemporal Spike Encoding of a Signal
1625
an important role in information coding based on spatiotemporal functional assemblies (Abeles, 1991; Eckhorn et al., 1988; Gray et al., 1989; Laurent & Davidowitz, 1994; Vaadia et al., 1995; Riehle et al., 1997; Stopfer et al., 1997; Stopfer & Laurent, 1999; Brecht et al., 1999; Rodriguez et al., 1999; Steinmetz et al., 2000). The mechanisms of synchronization have also been investigated with various neuron models (Hansel et al., 1995; van Vreeswijk, 1996; Ernst et al., 1998; Diesmann et al., 1999; Bressloff & Coombes, 2000; Masuda & Aihara, 2001). We expect inter-event intervals with synchronous ring events (Tokuda & Aihara, 2000) to preserve deterministic information of external continuous stimuli, which may be relevant to robust information processing in the biological brain. Acknowledgments
We thank I. Tokuda at Muroran Institute of Technology, Japan, for helpful discussions and suggestions. This work is supported by the Japan Society for the Promotion of Science and CREST, JST. References Abeles, M. (1991). Corticonics. Cambridge: Cambridge University Press. Brecht, M., Singer, W., & Engel, A. K. (1999). Patterns of synchronization in the superior colliculus of anesthetized cats. J. of Neurosci., 19(9), 3567–3579. Bressloff, P. C., & Coombes, S. (2000). Dynamics of strongly coupled spiking neurons. Neural Computation, 12, 91–129. Castro, R., & Sauer, T. (1997a). Chaotic stochastic resonance: Noise-enhanced reconstruction of attractors. Phys. Rev. Lett., 79(6), 1030–1033. Castro, R., & Sauer, T. (1997b) Correlation dimension of attractors through interspike intervals. Phys. Rev. E, 55(1), 287–290. Castro, R., & Sauer, T. (1999). Reconstructing chaotic dynamics through spike lters. Phys. Rev. E, 59(3), 2911–2917. Chang, T., Schiff, S. J., Sauer, T., Gossard, J-P., & Burke, R. E. (1994). Stochastic versus deterministic variability in simple neuronal circuits: I. Monosynaptic spinal cord reexes. Biophysical J., 67(2), 671–683. Diesmann, M., Gewaltig, M-O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., & Reitboeck, H. J. (1988). Coherent oscillations: A mechanism of feature linking in the visual cortex? Biol. Cybern., 60, 121–130. Ernst, U., Pawelzik, K., & Geisel, T. (1998). Delay-induced multistable synchronization of biological oscillators. Phys. Rev. E, 57(2), 2150–2162. FitzHugh, R. (1961). Impulses and physiological states in theoretical models of nerve membrane. Biophysical J., 1, 445–465. Fujii, H., Ito, H., Aihara, K., Ichinose, N., & Tsukada, M. (1996). Dynamical cell assembly hypothesis—theoretical possibility of spatio-temporal coding in the cortex. Neural Networks, 9(8), 1303–1350.
1626
Naoki Masuda and Kazuyuki Aihara
Gray, C., M., Konig, ¨ P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reects global stimulus properties. Nature, 338(23), 334–337. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Computation, 7, 307–337. Hodgkin, A. L. (1948).The local electric changes associated with repetitive action in a non-medullated axon. Journal of Physiology, 107, 165–181. Ichinose, N., & Aihara, K. (1998). Extracting chaotic signals from a superimposed pulse train. Proc. of 1998 Int. Symposium on Nonlinear Theory and Its Applications, 2, 695–698. Ichinose, N., Aihara, K., & Judd, K. (1998). Extending the concept of isochrons from oscillatory to excitable systems for modeling an excitable neuron. Int. J. Bifurcation and Chaos, 8(12), 2375–2385. Isaka, A., Yamada, T., Takahashi, J., & Aihara, K. (2001). Analysis of superimposed pulse trains with higher-order interspike intervals. Trans. IEICE, J84-A(3), 309–320 (in Japanese). Izhikevich, E. M. (2000). Neural excitability, spiking and bursting. Int. J. Bifurcation and Chaos, 10(6), 1171–1266. Judd, K. T., & Aihara, K. (1993). Pulse propagation networks: A neural network model that uses temporal coding by action potentials. Neural Networks, 6, 203–215. (Errata. 1994. Neural Networks, 7, 1491.) Judd, K., & Aihara, K. (2000). Generation, recognition and learning of recurrent signals by pulse propagation networks. Int. J. Bifurcation and Chaos, 10(10), 2415–2428. Koch, C. (1999). Biophysics of computation. New York: Oxford University Press. Laurent, G., & Davidowitz, H. (1994). Encoding of olfactory information with oscillating neural assemblies. Science, 265, 1872–1875. MacLeod, K., B¨acker, A., & Laurent, G. (1998). Who reads temporal information contained across synchronized and oscillatory spike trains? Nature, 395, 693– 698. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Mar, D. J., Chow, C. C., Gerstner, W., Adams, R. W., & Collins, J. J. (1999). Noise shaping in populations of coupled model neurons. Proc. Natl. Acad. Sci. USA, 96, 10450–10455. Masuda, N., & Aihara, K. (2001). Synchronization of pulse-coupled excitable neurons. Phys. Rev. E, 64(5), 051906. Nagumo, J., Arimoto, S., & Yoshizawa, S. (1962). An active pulse transmission line simulating nerve axon. Proc. of the IRE, 50, 2061–2070. Racicot, D. M., & Longtin, A. (1997). Interspike interval attractors from chaotically driven neuron models. Physica D, 104, 184–204. Reinagel, P., & Reid, R. C. (2000). Temporal coding of visual information in the thalamus. J. Neurosci., 20(14), 5392–5400. Richardson, K. A., Imhoff, T. T, Grigg, P., & Collins, J. J. (1998). Encoding chaos in neural spike trains. Phys. Rev. Lett., 80(11), 2485–2488.
Spatiotemporal Spike Encoding of a Signal
1627
Riehle, A., Grun, ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and rate modulation differentially involved in motor cortical function. Science, 278, 1950–1953. Rodriguez, E., George, N., Lachaux, J-P., Martinerie, J., Renault, R., & Varela, F. J. (1999). Perception’s shadow: Long-distance synchronization of human brain activity. Nature, 397, 430–433. Rossler, ¨ O. E. (1976). An equation for continuous chaos. Phys. Lett. A, 57, 397–398. Sauer, T. (1994). Reconstruction of dynamical systems from interspike intervals. Phys. Rev. Lett., 72(24), 3811–3814. Sauer, T. (1997). Reconstruction of integrate-and-re dynamics. Field Institute Communications, 11, 63–75. Schiff, S. J., Jerger, K., Chang, T., Sauer, T., & Aitken, P. G. (1994). Stochastic versus deterministic variability in simple neuronal circuits: II. Hippocampal slice. Biophysical J., 67(2), 684–691. Steinmetz, P. N., Roy, A., Fitzgerald, P. J., Hsiao, S. S., Johnson, K. O., & Niebur, E. (2000). Attention modulates synchronized neuronal ring in primate somatosensory cortex. Nature, 404, 187–190. Stopfer, M., Bhagavan, S., Smith, B. H., & Laurent, G. (1997). Impaired odour discrimination on desynchronization of odour-encoding neural assemblies. Nature, 390, 70–74. Stopfer, M., & Laurent, G. (1999). Short-term memory in olfactory network dynamics. Nature, 402, 664–668. Sugihara, G., & May, R. M. (1990). Nonlinear forecasting as a way of distinguishing chaos from measurement error in time series. Nature, 344, 734–740. Suzuki, H., Aihara, K., Murakami, J., & Shimozawa, T. (2000). Analysis of neural spike trains with interspike interval reconstruction. Biol. Cybern., 82, 305–311. Theiler, J., Eubank, S., Longtin, A., Galdrakian, B., & Farmer, J. D. (1992). Testing for nonlinearity in time series: The method of surrogate data. Physica D, 58, 77–94. Tiesinga, P. H., Fellous, J.-M., JosÂe, J. V., & Sejnowski, T. J. (2001). Optimal information transfer in synchronized neocortical neurons. Neurocomputing, 38–40, 397–402. Tiesinga, P. H. E., & Sejnowski, T. J. (2001). Precision of pulse-coupled networks of integrate-and-re neurons. Network, 12, 215–233. Tokuda, I., & Aihara, K. (2000). Inter-event interval reconstruction of chaotic dynamics. In Proc. of the Fifth International Symposium on Articial Life and Robotics (AROB 5th’00) (pp. 177–180). Tuckwell, H. C. (1988). Introduction to theoreticalneurobiology (Vol. 1). Cambridge: Cambridge University Press. Vaadia, E., Haalman, L., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neuronal interactions in monkey cortex in relation to behavioural events. Nature, 373, 515–518. van Steveninck, R. R. de R., Lewen, G. D., Strong, S. P., Koberle, R., & Bialek, W. (1997). Reproductivity and variability in neural spike trains. Science, 275, 1805–1808.
1628
Naoki Masuda and Kazuyuki Aihara
van Vreeswijk, C. (1996). Partial synchronization in populations of pulsecoupled oscillators. Phys. Rev. E, 54(5), 5522–5537. Warzecha, A-K., & Egelhaaf, M. (1999).Variability in spike trains during constant and dynamic stimulation. Science, 283, 1927–1930. Received May 8, 2001; accepted November 2, 2001.
LETTER
Communicated by Jack Cowan
Attractor Reliability Reveals Deterministic Structure in Neuronal Spike Trains P. H. E. Tiesinga
[email protected] Sloan-Swartz Center for Theoretical Neurobiology and Computational Neurobiology Lab, Salk Institute, La Jolla, CA 92037, U.S.A. J.-M. Fellous
[email protected] Howard Hughes Medical Institute and Computational Neurobiology Lab, Salk Institute, La Jolla, CA 92037, U.S.A. Terrence J. Sejnowski
[email protected] Sloan-Swartz Center for Theoretical Neurobiology and Computational Neurobiology Lab, Salk Institute; Howard Hughes Medical Institute, Salk Institute and Department of Biology, University of California–San Diego, La Jolla, CA 92037, U.S.A. When periodic current is injected into an integrate-and-re model neuron, the voltage as a function of time converges from different initial conditions to an attractor that produces reproducible sequences of spikes. The attractor reliability is a measure of the stability of spike trains against intrinsic noise and is quantied here as the inverse of the number of distinct spike trains obtained in response to repeated presentations of the same stimulus. High reliability characterizes neurons that can support a spike-time code, unlike neurons with discharges forming a renewal process (such as a Poisson process). These two classes of responses cannot be distinguished using measures based on the spike-time histogram, but they can be identied by the attractor dynamics of spike trains, as shown here using a new method for calculating the attractor reliability. We applied these methods to spike trains obtained from current injection into cortical neurons recorded in vitro. These spike trains did not form a renewal process and had a higher reliability compared to renewallike processes with the same spike-time histogram. 1 Introduction
Features that are present in the spike patterns elicited in response to a stimulus repeated across multiple trials can form the basis of a neuronal code. Here, we introduce a novel reliability measure in order to study the c 2002 Massachusetts Institute of Technology Neural Computation 14, 1629– 1650 (2002) °
1630
P. H. E. Tiesinga, J.-M. Fellous, and Terrence J. Sejnowski
reproducibility of sequences of precise spike times produced in vitro by cortical neurons (Mainen & Sejnowski, 1995). We show that spike trains obtained from in vitro cortical neurons and integrate-and-re model neurons have more deterministic structure than spike trains obtained from renewal processes with the same interspike interval and spike-time probability distribution. This structure cannot be detected from the spike-time histogram. The response of a neuron to a stimulus can be characterized as an attractor, dened as the voltage trajectory (and associated spike train) to which the neuron’s membrane potential converges from different initial conditions (Strogatz, 1994; Jensen, 1998). An attractor is stable against noise: weak noise introduces spike-time jitter, but the sequence of spike times and the inputinduced correlations between interspike intervals is conserved. Here, the sequence of spike times is considered the output signal. When the neuron stays on the attractor, it transmits a unique signal. The dynamics of a model neuron can undergo a bifurcation when its parameter values are varied. During a bifurcation, a small change in the value of a parameter introduces a large change in the spike times: a different attractor emerges. When a neuron is close to a bifurcation point, noise can induce a transition to another attractor. If the new “attractor ” is unstable, the neuron will return to the original attractor after a nite time, but if it is stable, the neuron stays on the new attractor. The attractor reliability is dened as the inverse of the number of distinct spike trains that are obtained after a large number of trials. This article consists of two parts. First, a test is given to determine whether a spike train forms a temporally modulated renewal process. Second, we quantify the extra—non-Poisson—structure present in spike trains using the attractor reliability. These techniques are illustrated using spike trains obtained from experimental data and numerical model simulations. A systematic study of the attractor and bifurcation structure of in vitro and model neurons driven by periodic, quasiperiodic, and random currents will be presented elsewhere. 2 Methods 2.1 Experimental Methods. The voltage response of cortical neurons as measured in a rat slice preparation was described previously (Fellous et al., 2001). Protocols for these experiments were approved by the Salk Institute Animal Care and Use Committee, and they conform to U.S. Department of Agriculture regulations and National Institutes of Health guidelines for humane care and use of laboratory animals. Briey, coronal slices of rat prelimbic and infralimbic areas of prefrontal cortex were obtained from 2- to 4week-old Sprague-Dawley rats. Rats were anesthetized with isourane and decapitated. Their brain was removed and cut into 350 m m thick slices on a Vibratome 1000. Slices were then transfered to a submerged chamber containing articial cerebrospinal uid (ACSF, mM: NaCl, 125; NaH2 CO3 , 25;
Attractor Reliability in Neuronal Spike Trains
1631
D-glucose, 10; KCl, 2.5; CaCl2 , 2; MgCl 2 , 1.3; NaH2 PO4 , 1.25) saturated with 95% O2 /5% CO2 at room temperature. Whole cell patch clamp recordings were achieved using glass electrodes containing (4-10 MV: mM; KmeSO4 , 140; Hepes, 10; NaCl, 4; EGTA, 0.1; Mg-ATP, 4; Mg-GTP, 0.3; Phosphocreatine 14). Patch clamp was performed under visual control at 30–32± C. In most experiments, Lucifer yellow (RBI, 0.4%) or Biocytin (Sigma, 0.5%) was added to the internal solution for morphological identication. In all experiments, synaptic transmission was blocked by D-2-amino-5-phosphonovaleric acid (D-APV; 50 m M), 6,7-dinitroquinoxaline-2,3, dione (DNQX; 10 m M), and biccuculine methiodide (Bicc; 20 m M). All drugs were obtained from RBI or Sigma, freshly prepared in ACSF and bath applied. Data were acquired with Labview 5.0 and a PCI-16-E1 data acquisition board (National Instrument) and analyzed with MATLAB (The Mathworks). We used regularly spiking layer 5 pyramidal cells that were identied morphologically. 2.2 Simulation Algorithm. The membrane potential V of an integrateand-re neuron driven by a periodic current satised,
2p dV D ¡V C I C A sin t C j ( t) , dt T
(2.1)
where I was a time-independent driving current, A was the amplitude of the drive, T D 2 was the period, and j was a white noise current, with zero mean and variance D, that represented the effects of intrinsic noise. When the voltage V reached threshold, V ( t) D 1, a spike was emitted, and the voltage was instantaneously reset to zero, V ( t) D 0. Dimensionless units were used in model simulations. One voltage unit corresponded to the distance between resting membrane potential and action potential threshold, approximately 20 mV; one time unit corresponded to the membrane time constant, approximately 40 ms. Equation 2.1 was integrated directly using the fourth-order Runge-Kutta algorithm (Press, Teukolsky, Vetterling, & Flannery, 1992), with step size dt D 0.01. The calculated voltage differed by less than 10¡11 from the voltage obtained by analytically integrating equation 2.1 for a sinusoidal current and with D D 0. The spike time ts , given by the expression V ( ts ) D 1, was determined by linear interpolation (Hansel, Mato, Meunier, & Neltner, 1998). 3 Results 3.1 Example of Deterministic Structure in Spike Trains. In most experiments, the same stimulus was presented on different trials, and the resulting spike trains were analyzed. The spike trains of a hypothetical experiment are presented in a rasterplot in Figure 1Aa. The computation of reliability and precision based on the spike-time histogram (STH) followed the procedure in Mainen and Sejnowski (1995). The length of a trial was
1632
P. H. E. Tiesinga, J.-M. Fellous, and Terrence J. Sejnowski
Figure 1: Reliability based on the spike-time histogram is insensitive to deterministic structure in spike trains. (A, B) The same spike times, with the trials ordered differently. On each row: (a) Rasterplot, with each measured spike represented as a circle, its x-ordinate is the spike time and its y-ordinate the trial index; (b) the spike-time histogram (STH)—the number of spikes in a particular time bin normalized by the number of trials. Events were dened as threshold crossings of the spike-time histogram. The event reliability was the fraction of trials during which a spike occurred during the event; the precision was the inverse of the standard deviation of the spike times in the event. The reliability of the response, RSTH , was the event reliability averaged over all events. (A) In each of the 100 trials, a spike was present during the rst event; hence, its reliability was 1. (B) The trials of A were composed of two distinct spike trains, indicated by the lled and open circles in a. In b, the resulting STH for each group is plotted separately as lled and open peaks, respectively. The rst peak in A was in fact made up of two separate events.
divided into discrete bins, and the number of spikes that fell in each bin was counted. The spike count in a bin was normalized by the number of trials. The set of bins was convolved with a smoothing function. The STH so obtained usually contained a number of peaks, each consisting of a set of consecutive bins that contain more spike counts than average. Peaks were
Attractor Reliability in Neuronal Spike Trains
1633
detected by setting a threshold and determining when the smoothed STH crossed it (see Figure 1Ab). The peaks constitute events; the reliability of an event was the fraction of trials during which a neuron spiked during the event. The reliability of the spike train, RSTH , was the reliability of each event averaged over all events. (An equivalent procedure for RSTH is given in equation 3.5.) The precision of an event was the inverse of the standard deviation of the spike times that were part of the peak. The reliability based on the STH depended on the size of the bins, the smoothing function, and the value of the peak-detection threshold. The STH reliability so dened was insensitive to deterministic structure in the spike trains. An example of how STH reliability can miss important deterministic structure in spike trains is shown in Figure 1. The data shown in Figure 1Aa were obtained by randomly mixing two types of spike trains: the open and lled circles in Figure 1Ba. Hence, Figures 1Aa and 1Ba are the same, except that the trials are reordered. On each trial, there was a spike present in the rst event in Figure 1Ab; hence, its “reliability” was 1. However, that single event in fact consisted of two events (see Figure 1Bb). In other words, there were two distinct spike trains (attractors) present across different trials.
3.2 Testing of Renewal Property. A temporally modulated renewal spike train is fully characterized by its spike-time probability density and its interspike-interval distribution. The dening property of a renewal spike train is that the intervals between spikes are independent. There is no deterministic structure in renewal spike trains apart from the structure induced by a time-varying ring rate (see below). In contrast, for an attractor, the spikes occur at particular times and in a particular sequence (see section 1): there are correlations between the spike times, and there is deterministic structure in the spike train. Hence, it is different from a renewal process. Spike trains that form a renewal process are not reliable. It is therefore important to distinguish renewal processes from nonrenewal processes (Reich, Victor, & Knight, 1998). The Poisson process is an example of a renewal process with rate l. The probability of nding a spike between times t and t C dt is ldt and does not depend on previous spike times. Hence, the spike times themselves are also independent. The probability l is time independent and is estimated from the spike-time histogram. The distribution of interspike intervals is exponential; the probability of obtaining an interspike interval between t and t C dt is le¡t l dt. There is no correlation between the length of consecutive intervals. The coefcient of variation (CV), the standard deviation divided by the mean of t , is equal to 1. Another example of a renewal process is a gamma process of order r. The probability of obtaining a spike between times t and t C dt again is ldt. However, the distribution of interspike intervals is given by a gamma probability distribution and is less disperse, with CV D p1r and r > 1. The gamma process of integer order is closely related
1634
P. H. E. Tiesinga, J.-M. Fellous, and Terrence J. Sejnowski
to the Poisson process. For instance, a gamma process of order 2 is obtained from a Poisson process of twice the rate, 2l, by removing every other spike. Shufing procedures can form the basis for a test of the renewal character of spike trains. First, generate a number of different spike trains (trials) using the same process. Then shufe the spike times randomly across different trials. The probability of obtaining a spike at a given time is conserved, since exactly the same spike times were used. If the spikes were generated by a Poisson process, then they are independent of each other. Hence, the new set of spike trains cannot be distinguished statistically from the original set of spike trains. The shufing procedure leads to an exponential distribution of interspike intervals, and CV D 1. For spike trains generated by a gamma process of order r, CV D p1r . However, after shufing, the CV changed to 1, and the new set of spike trains can be distinguished from the original set of spike trains by comparing the CVs. Hence, shufing of spike times is not appropriate for a gamma process. However, when the interspike intervals are shufed randomly across trials and then used to calculate the new spike times, the interspike interval distribution is conserved. Since the intervals in a renewal process are independent, the new set of spike trains should be indistinguishable from the original set. Note that the actual spike times are not conserved. As a result, the spike-time probability (the spike-time histogram) is different at the boundaries, close to the beginning and end of the trial. However, at some distance from the boundaries, it is approximately the same. An example is discussed below. The spike-time histogram obtained from neural spike trains usually is time dependent: l(t) is a function of time. The spike trains may still form a temporally modulated renewal process; the probability of obtaining a spike between t and t C dt is l(t) dt. The time dependence of l(t) makes testing of the renewal character difcult, since the distribution of interspike intervals depends on time via l(t) . However, temporally modulated renewal processes (referred to as “simply modulated renewal process” in Reich et al. (1998)) can be mapped into a time-independent (homogeneous) renewal process by “transforming” time. In what follows, we describe the test of the renewal property in detail. It consisted of three steps. First, time t was mapped into a new time s. Then the interspike intervals were shufed randomly across trials. Finally, the test statistic was evaluated, and its signicance was determined. The procedure is applied to spike trains obtained from repeated current injection into cortical neurons in vitro. 3.2.1 Experimental Data Set. The stimulus wave forms shown in Figure 2Ac were injected in a cortical neuron in current-clamp mode. The stimulus consisted of a 200 ms constant depolarizing current followed by a 1600 ms sinusoidal current with period 100 ms. The height of the initial depolarizing pulse was varied to bring the neuron to different initial voltages at the start of the sinusoidal wave form. For a large enough amplitude, a spike was obtained during the current pulse (see Figure 2Ab). After a transient
Attractor Reliability in Neuronal Spike Trains
1635
Figure 2: Effect of initial condition on spike timing in cortical neurons. The stimulus wave form consisted of a 200 ms long constant depolarizing current followed by a 1600 ms long sinusoidal current with period 100 ms. The amplitude of the initial depolarizing current was varied over 11 different values. Each of the resulting wave forms was presented 20 times, yielding 220 trials. (A) (a, b) Recorded voltage response during the presentation of the rst and eleventh stimulus wave form, respectively; (c) the 11 stimulus wave forms. (B, C) In each row, (a) the rastergram and (b) the spike-time histogram is plotted using (B) the experimental data; (C) Poisson spike trains that were obtained by randomly shufing spike times across trials. The spike-time histograms were identical, and the rastergrams looked similar. However, the experimental spike trains did not form a renewal process (see Figure 4).
1636
P. H. E. Tiesinga, J.-M. Fellous, and Terrence J. Sejnowski
of approximately three cycles, the neuron generated on average about one action potential on every two cycles. The rastergram and spike-time histogram of the spike trains obtained during 20 presentations of 11 different wave forms (220 trials) are shown in Figure 2B. A Poisson process with the same spike-time histogram was generated by randomly shufing the spike times across different trials (see Figure 2C). 3.2.2 Mapping the Spike Times. The spike time probability l(t) was estimated from the spike-time histogram P (t ). Note that P ( t) was evaluated in discrete bins; however, for simplicity, the notation used here does not reect this. Time t was transformed into a new time variable s such that the probability of obtaining a spike between s and s C ds was independent of time s (Reich et al., 1998). This procedure transformed an inhomogeneous renewal process into a homogeneous renewal process. The transformation was t ! s D g (t) with Rt g (t) D TSTH R T 0 0
du P (u )
STH
du P ( u )
,
(3.1)
where u was an integration variable and TSTH was the length of each trial. The transformation tin ! sin was performed without explicitly calculating P ( t) (see Figure 3). Let fti1 , ¢, tiNi g be the set of Ni spike times during the ith trial, i D 1, ¢, Ntr , where Ntr was the number of trials (see Figure 3A). The set of all spikes in all trials was collected into one set labeled by a dummy index j, ft1 , ¢, tM g, and sorted in increasing value, tj (1) · tj (2) · ¢ · tj ( k) · ¢ · tj ( M) , P Ntr here, M D i D1 Ni was the total number of spikes across all trials (see Figure 3B). The transformed time was sin D
k ( j) T STH . M
(3.2)
Here, k ( j) was the index of the jth spike time tj D tin in the sorted set, i was the trial index, and n was the spike index (see Figure 3C). For the experimental data set, the rasterplot of transformed spike times sin D g ( tin ) consisted of dots that were uniformly distributed in the plane, and the STH was constant (see Figures 4Aa and 4Ab). 3.2.3 Random Shufings of the Interspike Intervals. The interspike intervals in transformed time were dened as ºni D sin ¡ sin¡1 . The 0th spike time was dened as si0 D 0, so that there are as many intervals as there are spikes; however, si0 was not included in the analysis (and was not shown in the graphs). The intervals ºni were randomly shufed across trials, and the new times were then obtained from the shufed intervals ºOni as, Pn spike i i sOn D jD 1 ºOj (j is a dummy index). This procedure was repeated to obtain
Attractor Reliability in Neuronal Spike Trains
1637
Figure 3: Procedure to obtain a homogeneous renewal process. (A) Spike trains obtained on different trials. tin was the nth spike time in the ith trial. (B) Spike times across all trials were combined into one set and sorted from low to high values (the index is indicated below the ticks). (C) The index k in the ordered set as a function of the spike time. The new spike time was sin D kTSTH / M and took values between 0 and TSTH . (D) The resulting spike-time histogram was time independent.
Ns different independent realizations of the corresponding renewal process. These realizations will be referred to as surrogate spike trains. One realization generated from the experimental data set is shown in Figure 4B. As mentioned before, the STH (see Figure 4Bb) is reduced compared with the STH of the original spike trains (see Figure 4Ab) near the beginning and end of the stimulus presentation. The structure of the transformed interspike intervals, ºni (see Figure 4Ac) is different from the one obtained from the surrogates (see Figure 4Bc). That difference will be quantied next. 3.2.4 Test Statistics for Renewal Processes. For a temporally modulated renewal process, all the time dependence can be removed by making the transformation t ! s. However, if there was nonrenewal structure present in the original spike trains, there could still be a time dependence. Here, we focus on the time dependence of interspike intervals. The time axis was divided in Nº discrete bins of width D s. The mean º ( m) was calculated of
1638
P. H. E. Tiesinga, J.-M. Fellous, and Terrence J. Sejnowski
Figure 4: Spike trains of cortical neurons did not form a renewal process. Time was rescaled according to t ! s D g ( t ) , such that the spike-time probability distribution—the spike-time histogram—is independent of time s (see text). In A, rescaled spike trains from Figure 2 and in B renewal spike trains obtained by randomly shufing the interspike intervals from A across different trials were shown. In each row were plotted (a) the rastergram of rescaled spike times sin D g ( tin ) (tin is the nth spike time in the ith trial), (b) the spike-time histogram of rescaled spike times, and (c) the rescaled interspike intervals, ºni D sin ¡ sin¡1 , versus sin¡1 . (C) The starting time of each interval was binned (bin width was 1.5 ms), and the average interval º( s ) was determined for each bin (solid line with lled circles). Condence intervals, ºO § sO (dashed lines), were determined for a renewal process with the same interval distribution using 20 random shufings (one example was shown in B). The experimental spike trains were signicantly different from renewal spike trains, Â 2 D 2.8 for the bins between 700 and 1600 ms at signicance p < 10¡11 .
Attractor Reliability in Neuronal Spike Trains
1639
all ºni with starting points sin¡1 that fell in the mth bin, ( m ¡ 1) D s · sin¡1 < mD s. The same procedure was performed on all of the Ns surrogate spike trains, l D 1, . . . , Ns , yielding ºO l ( m) . The mean and standard deviation were subsequently determined: Ns 1 X ºO l (m ) , Ns lD1 v u Ns u 1 X (ºO l ( m) ¡ ºO ( m ) ) 2 . sO ( m) D t Ns ¡ 1 lD1
ºO ( m) D
The original spike trains are nonrenewal when º ( m) lies outside the condence interval given by ºO ( m) ¡ sO ( m ) and ºO ( m ) C sO (m ) of the equivalent renewal proces. The test statistic was: Â2 D
´ Nº ³ 1 X º ( m) ¡ ºO ( m ) 2 . sO ( m) Nº mD 1
(3.3)
The hypothesis that the spike train was generated by a renewal process was rejected when p D 1 ¡ C ( NºÂ 2 , Nº ) was smaller than a prescribed critical value. Here we assume that C ( NºÂ 2 , Nº ) was the cumulative  2 probability distribution with Nº degrees of freedom (Abramowitz & Stegun, 1974; Larsen & Marx, 1986; Press et al., 1992). In the following, the continuous notation º (s ) will be used instead of º (m ) . The condence intervals for the experimental recordings are shown together with º ( s) in Figure 4C. For the original process, the interval distribution depended on time, whereas for the renewal process, it did not depend on time. The  2 statistic based on Ns D 20 surrogates was  2 D 2.8 for 60 degrees of freedom. Hence, the difference was highly signicant, p < 10¡11, and the discharge produced by the neuron was not a renewal process. 3.3 Attractor Reliability. The attractor reliability was calculated in three steps. First, events were found (see Figure 5). Second, the binary representation for each trial was determined (see Figure 7). Third, the entropy of the distribution of spike trains was calculated (see Figure 8). The procedure was applied to the experimental data from Figure 2 and surrogate spike trains.
3.3.1 Determination of Events. An event was dened as a spike time that occurred across multiple trials. The following algorithm can be used to determine the spike times that are part of a given event. All spike times were combined in one set and ordered from low to high values, yielding ft1 , . . . , tk, . . . , tM g. Here, k denotes the index of the spike time in the ordered sequence. The spike time versus k curve had a steplike structure
1640
P. H. E. Tiesinga, J.-M. Fellous, and Terrence J. Sejnowski
Figure 5: Procedure for determining events in the rastergram. Data from cortical neuron in Figure 2. (A) All the spike times were combined into one set and were ordered increasing from left to right. (B) The rst difference of the sorted spike times was thresholded (threshold was 40 ms, dashed line). Spike times between two consecutive upward threshold crossings constituted an event. (C) The standard deviation of the spike times in an event and (D) the number of spikes in an event plotted as a function of event index.
(see Figure 5A). A step was formed by spike times with similar values and corresponded to an event. Between events and steps, spike times changed quickly. The rst difference of the ordered spike times, tk D tkC 1 ¡ tk, was small within an event (roughly, the jitter divided by the number of trials) and large between different events, when k was part of one event and k C 1 was part of the other event. Hence, the time series tk consisted of large values separated by many small values (see Figure 5B). The set of k values, k1 , . . . , kE , where t k¡1 crossed a threshold from below was determined. E is the number of events that were detected. For e D 1, . . . , E, event e consisted of ftke¡1 , . . . , tke ¡1 g with the denition keD 0 D 1. The number of spikes in an event was Ne D ke ¡ ke¡1 . The spike-time jitter (called “precision” in Mainen & Sejnowski (1995)) was the standard deviation of all the spike times in a given event e, v0 1 u u kX e ¡1 1 u se D t@ (3.4) t2 A ¡ t2e , Ne j D k j e¡1
te D
e ¡1 1 kX tj . Ne jD k e¡1
Attractor Reliability in Neuronal Spike Trains
1641
The threshold should be chosen such that the number of spikes per event is large (but smaller than the number of trials) and the spike-time jitter is small. For the experimental data, we used a threshold of 40 ms, and 17 events were obtained. The spike-time jitter was approximately 5 ms for all events except the rst (see Figure 5C). The number of spikes in an event decreased from a maximum of 220 (all trials) at the second event to approximately 110 (half the number of trials) in the last event (see Figure 5D). The STH-based reliability was RSTH D
E 1 X Ne . ENtr eD 1
(3.5)
Here, Ntr was the number of trials and E was the number of events, as before. For the experimental data, RSTH ¼ 0.64. 3.3.2 Determination of the Binary Representation. characterized by a binary value, Xi D
E X
nie 2E¡e .
Each trial i was then
(3.6)
eD 1
Here, nie D 1 when there was a spike during event e on the ith trial and nie D 0 otherwise. Binary numbers were also associated with subsets of the i for the events e D b, . . . , b C L ¡ 1 of trial i: spike trains, XbL i XbL D
L X jD 1
nib C j¡1 2L¡j .
(3.7)
Here, j was a dummy index, and L was the word length. It follows that i . Xi D X1E 3.3.3 Surrogate Spike Trains. Surrogate spike trains were constructed to compare the experimental spike trains with the equivalent Poisson process. Spike times could be randomly shufed across trials, as before. However, in that case, there could be more than one spike time during an event on a given trial. This resulted in an ambiguous binary representation that could be resolved by, for instance, using ternary numbers. However, a different Poisson-like surrogate was constructed instead by randomly permuting spike times across trials for each event separately. Let ftini g be all the spike times during event e, and denote the absence of a spike during a given trial i by - (ni is the index of the spike in the ith trial that is part of event e). Then the spike times during an event could be, for instance, ft1n1 , -, -, t4n4 g (see Figure 6A). A surrogate obtained by a random permutation would be, for instance, ft4n4 , t1n1 , -, -g (see Figure 6B). Consider the case that the original
1642
P. H. E. Tiesinga, J.-M. Fellous, and Terrence J. Sejnowski
Figure 6: Procedure to generate Poisson-like surrogate spike trains. (A) Original set of spike trains: on trials 1 and 4, attractor 1 (At1) and on trials 2 and 3, attractor 2 (At2) was reached. (B) The spike times in each event e were randomly shufed across trials. The resulting spike trains no longer resembled the attractors.
spike trains had deterministic structure and were either one of two attractor spike trains (see Figure 6A). The above procedure then breaks up the attractor spike trains and removes the non-Poisson structure (see Figure 6B). Hence, by comparing, for instance, the binary representation of the original and surrogate spike trains, the amount of non-Poisson structure can be assessed. 3.3.4 Binary Representation of Experimental Spike Trains. A transient was discarded, and binary representations were calculated based on 10 events starting from the eighth spike, yielding Xi8,10 (see Figure 7Aa). The trials were sorted according to the binary representation starting from the lowest value. There were two plateaus in the X8,10 versus index graph (indicated by * in Figure 7Aa). Each of the X8,10 values were obtained on 6% of the trials (* in Figure 7Ab). These spike sequences corresponded to the two 1:2 modelocking attractors: the neuron produced spikes on either the odd cycles or only on the even cycles. On a larger number of trials, neurons were on these attractors for a shorter duration, which led to a triangle-like structure in the rastergram (see Figure 7Ac). The same analysis was performed on surrogate spike trains, and the results are shown in Figure 7B. There were no plateaus in the X8,10 versus index graph (see Figure 7Ba). Each X8,10 occurred with approximately equal
Attractor Reliability in Neuronal Spike Trains
1643
Figure 7: Cortical neuron spike trains had a deterministic structure not present in Poisson-like spike trains. Each trial was assigned a binary representation as described in the text. In A, the original data from Figure 2 and in B spike trains obtained by randomly shufing spike times of each event across trials were used. In each row, (a) binary representation X8, 10 , (b) distribution of X8,10 values across 220trials, and (c) rastergram with trials ordered according to value of X8,10 , increasing from bottom to top. The stars (*) and the arrows (Ã) are described in the text.
probability (see Figure 7Bb). The length of time that a neuron spent on an attractor on a given trial was reduced compared to the original spike trains. In particular, the triangle-like structure in the ordered rastergram was not present (see Figure 7Bc). The difference between the two sets of spike trains is quantied next using the entropy. 3.3.5 Entropy of Spike Trains. Let PbL ( XbL ) be the probability of obtaining a trial with binary representation XbL . It was estimated using a nite number of trials by determining all the distinct words XbL and counting how often each of them occurred across trials. The count was then normalized by the number of trials to obtain a probability. An example was shown in
1644
P. H. E. Tiesinga, J.-M. Fellous, and Terrence J. Sejnowski
Figure 7Ab. The entropy of this distribution was SbL D ¡
X
P ( XbL ) log2 P ( XbL ).
(3.8)
XbL
The entropy was then averaged over all allowed b values, SL D
E¡L XC 1 1 SbL . E ¡ L C 1 bD 1
(3.9)
The entropy of the distribution of the binary representation for the whole trial length was S D S1E , SD¡
X
P ( X) log2 P ( X ).
(3.10)
X
The attractor reliability was dened as Ra D 2 ¡S .
(3.11)
The attractor probability can be interpreted as the inverse of the number of different Xi (this would be exact if each X value occurred with equal probability). 3.3.6 Entropy of Surrogate Spike Trains. The entropy of the surrogates was estimated as the mean of S and SL over independent surrogates. An analytical expression for the entropy of the surrogate spike train was calculated. Let the probability of obtaining a spike during event e on a given trial be pe . The probability for a trial X with event occupation numbers fne g D fn1 , . . . , nE g was P (X ) D
E Y e D1
pne e (1 ¡ pe ) (1¡ne ) ,
(3.12)
and the entropy of this distribution was SD ¡ D ¡
X fne g E X eD 1
P ( X) log2 P ( X )
[pe log2 pe C (1 ¡ pe ) log2 (1 ¡ pe ) ].
(3.13)
P P P Here fne g D 1n1 D 0 , . . . , 1nE D 0 is the sum over all possible combinations of event occupation numbers. The nal result followed since the entropy of a
Attractor Reliability in Neuronal Spike Trains
1645
Figure 8: The binary representations of the experimental spike trains were less variable than those of Poisson-like processes. Binary representations with word length L were determined for each trial, and the entropy was calculated as described in the text, starting from (A) the rst spike time and (B) the eighth spike time. The entropy of experimental spike trains (dashed line), entropy of surrogate spike trains averaged over 10 random shufings (lled circles), their difference (dot-dashed line), and the analytical result for the entropy of surrogate spike trains (solid line) are shown. The standard deviation of the surrogate spiketrain entropy was of the order of the circle size.
product of probability density functions is the sum over the entropy of each probability distribution. The analytical entropy of the surrogate spike trains increased linearly with E. However, for Ntr trials, the maximum entropy was log2 Ntr ; hence, the analytical limit may not be reached if too few trials are available for analysis. The entropy of the experimental spike trains was determined as a function of the word length L (see Figure 8). The entropy of the surrogate spike trains was calculated in two ways: rst by determining the mean entropy of 10 surrogate spike trains and then analytically by using equation 3.13. The probability pe of a spike during event e was estimated as the number of spikes during that event in the original data divided by the number of trials. The entropy of the original spike train was always lower than that of the surrogate spike train. This implies that there was additional structure present in the original spike train that was not present in surrogate spike trains. The entropy of surrogate spike trains started to differ from the analytical result at L D 8 (see Figure 8A). This indicated that the number of trials was not large enough to sample the probability distribution of the binary representations adequately. Initially, the difference between the entropy of the original and surrogate spike trains increased with L, but when the sampling effects occurred, it decreased. As a result, the difference had a maximum that was a sampling artifact (dot-dashed line in Figure 8A). Similar results were obtained for spike trains starting from the eighth event (see Figure 8B).
1646
P. H. E. Tiesinga, J.-M. Fellous, and Terrence J. Sejnowski
3.4 Simulated Spike Trains. The statistical test for renewal behavior and the procedure to determine the attractor reliability were also applied to simulated spike trains. As an example, the spike trains of a noisy integrateand-re neuron that was 1:2 entrained to a sinusoidal drive were examined (see Figure 9A). The entropy of the original spike trains was lower than the surrogate spike trains (see Figure 9B). The spike trains were nonrenewal; the probability of obtaining the same º (s ) (see Figure 9C) from a renewal spike train was zero. The attractor reliability was Ra D 2 ¡3.74 ¼ 0.075. 4 Discussion
Other statistical tests to determine whether neural spike trains form a renewal process have been proposed previously, such as the power ratio (Reich et al., 1998). The critical value of the power ratio depended on the interval distribution of the (rescaled) spike train. The test introduced here could be applied to all interval distributions, and its signicance value was determined using the  2 distribution. Oram, Wiener, Lestienne, and Richmond (1999) proposed a procedure to determine whether certain patterns of spikes in multiple unit recordings were present above chance levels (see also Abeles & Gat, 2001). The attractor reliability introduced here is related to this procedure since it assesses whether certain patterns—attractors—occur more often than expected for Poisson processes with the same spike-time histogram. The number of distinct spike trains was determined using a simple procedure. This procedure succeeds if (1) the spike times within an event are sufciently precise and (2) the distance between different events is larger than the spike-time jitter within an event. The latter condition is not always satised; for instance, in Poisson spike trains, spikes can occur with arbitrarily small interspike intervals. In real spike trains, there is always a minimum interspike interval equal to the refractory period. When either of the two conditions is not satised, a different method is required to separate spike trains into groups. Two alternatives are the k means method for clustering (Gershenfeld, 1999) or an algorithm based on spike metrics such as in Victor and Purpura (1996). To calculate the entropy of spike trains and compare to equivalent renewal processes, a sufcient number of trials was needed. We found 40 to 1000 trials to be sufcient for most analyses. Recent experimental studies show that the amplitude of a postsynaptic conductance in response to a presynaptic action potential depends on the previous presynaptic spike times (Markram & Tsodyks, 1996; Abbott, Varela, Sen, & Nelson, 1997; Markram, Wang, & Tsodyks, 1998). As a result, synapses are sensitive to temporal correlations in input spike trains (Brenner, Strong, Koberle, Bialek, & de Ruyter van Steveninck, 2000; Eguia, Rabinovich, & Abarbanel, 2000; Tiesinga, 2001). For instance, a Poisson spike train and a periodic spike train with the same average rate will yield different postsynaptic amplitudes. When there are more distinct spike trains
Attractor Reliability in Neuronal Spike Trains
1647
Figure 9: Spike trains obtained from a noisy integrate-and-re neuron that was 1:2 entrained to a sinusoidal drive. The spike trains had deterministic structure and did not form a renewal process. (A) The rastergrams (a) in original order and (b) ordered on binary representation. (B) Spike-train entropy; curves were annotated as in Figure 8A. (C) Average rescaled interspike interval as a function of time with condence intervals (notation as in Figure 4C). Model parameters were I D 1.0, A D 0.17, D D 10¡4 , T D 2. To facilitate comparison to experimental data, time was rescaled by a factor of 40 ms.
1648
P. H. E. Tiesinga, J.-M. Fellous, and Terrence J. Sejnowski
across trials, the synaptic drive gets more variable. The attractor reliability is a measure for this type of synaptic variability. If the task of a neuron is to transmit information about its input into its output spike train, then this variability would be considered noise since it is not related to the input. The impact on a postsynaptic neuron of the unreliability-induced synaptic variability is not clear. Cortical neurons receive a large number of synaptic inputs from different cells (reviewed in Shadlen & Newsome, 1998) and synapses themselves are unreliable (Bekkers, Richerson, & Stevens, 1990; Allen & Stevens, 1994; Zador, 1998). This issue remains for further study. A more immediate issue is how reliability depends on the characteristics of the driving input and the intrinsic neuronal dynamics. In previous experimental and theoretical studies (Mainen & Sejnowski, 1995; Nowak, SanchezVives, & McCormick, 1997; Tang, Bartels, & Sejnowski, 1997; Hunter, Milton, Thomas, & Cowan, 1998; Warzecha, Kretzberg, & Egelhaaf, 1998, 2000; Cecchi et al., 2000; Kretzberg, Egelhaaf, & Warzecha, 2001; Fellous et al., 2001), it was shown, using a different reliability measure, that neurons re unreliably in response to constant depolarizing current, but re reliably when driven by inputs containing high-frequency components. The reliability measure introduced here forms part of a theoretical framework that allows for the systematic study of neuronal reliability. Elsewhere, we will investigate how the attractor reliability depends on the type and number of attractors and their bifurcation structure. Acknowledgments
We thank Jack Cowan, Jorge JosÂe, Bruce Knight, Susanne Schreiber, and Peter Thomas for discussions and suggestions and Greg Horwitz, Arnaud Delorme, and the anonymous referees for comments that helped improve the presentation of the article. Some of the numerical calculations were performed at the High Performance Computer Center at Northeastern University. This work was partially funded by the Sloan-Swartz Center for Theoretical Neurobiology (P.T.) and the Howard Hughes Medical Institute (J.M.F., T.J.S.). References Abbott, L., Varela, J., Sen, K., & Nelson, S. (1997).Synaptic depression and cortical gain control. Science, 275, 220–224. Abeles, M., & Gat, I. (2001). Detecting precise ring sequences in experimental data. J. Neurosci. Methods, 107, 141–154. Abramowitz, M., & Stegun, I. (1974). Handbook of mathematical functions. New York: Dover. Allen, C., & Stevens, C. (1994). An evaluation of causes for unreliability of synaptic transmission. Proc. Natl. Acad. Sci., 91, 10380–10383.
Attractor Reliability in Neuronal Spike Trains
1649
Bekkers, J., Richerson, G., & Stevens, C. (1990). Origin of variability in quantal size in cultured hippocampal neurons and hippocampal slices. Proc. Natl. Acad. Sci., 87, 5359–5362. Brenner, N., Strong, S., Koberle, R., Bialek, W., & de Ruyter van Steveninck, R. (2000). Synergy in a neural code. Neural Comput., 12, 1531–1552. Cecchi, G., Sigman, M., Alonso, J., Martinez, L., Chialvo, D., & Magnasco, M. (2000). Noise in neurons is message dependent. Proc. Natl. Acad. Sci., 97, 5557–5561. Eguia, M., Rabinovich, M., & Abarbanel, H. (2000). Information transmission and recovery in neural communications channels. Phys. Rev. E, 62, 7111– 7122. Fellous, J.-M., Houweling, A., Modi, R., Rao, R., Tiesinga, P., & Sejnowski, T. (2001). The frequency dependence of spike timing reliability in cortical pyramidal cells and interneurons. J. Neurophys., 85, 1782–1787. Gershenfeld, N. (1999). The nature of mathematical modeling. Cambridge: Cambridge University Press. Hansel, D., Mato, G., Meunier, C., & Neltner, L. (1998).On numerical simulations of integrate-and-re neural networks. Neural Comput., 10, 467–483. Hunter, J., Milton, J., Thomas, P., & Cowan, J. (1998). Resonance effect for neural spike time reliability. J. Neurophysiol., 80, 1427–1438. Jensen, R. (1998). Synchronization of randomly driven nonlinear oscillators. Phys. Rev. E, 58, 6907–6910. Kretzberg, J., Egelhaaf, M., & Warzecha, A. (2001). Membrane potential uctuations determine the precision of spike timing and synchronous activity: A model study. J. Comput. Neurosci., 10, 79–97. Larsen, R., & Marx, M. (1986). An introduction to mathematical statistics and its applications. Englewood Cliffs, NJ: Prentice Hall. Mainen, Z., & Sejnowski, T. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Markram, H., & Tsodyks, M. (1996). Redistribution of synaptic efcacy between neocortical pyramidal neurons. Nature, 382, 807–810. Markram, H., Wang, Y., & Tsodyks, M. (1998). Differential signaling via the same axon of neocortical pyramidal neurons. Proc. Natl. Acad. Sci., 95, 5323–5328. Nowak, L., Sanchez-Vives, M., & McCormick, D. (1997). Inuence of low and high frequency inputs on spike timing in visual cortical neurons. Cereb. Cortex, 7, 487–501. Oram, M., Wiener, M., Lestienne, R., & Richmond, B. (1999). Stochastic nature of precisely timed spike patterns in visual system neuronal responses. J. Neurophysiol., 81, 3021–3033. Press, W., Teukolsky, S., Vetterling, W., & Flannery, B. (1992). Numerical recipes. Cambridge: Cambridge University Press. Reich, D., Victor, J., & Knight, B. (1998). The power ratio and the interval maps: Spiking models and extracellular recordings. J. Neurosci., 18, 10090–10104. Shadlen, M., & Newsome, W. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18, 3870–3896. Strogatz, S. (1994). Nonlinear dynamics and chaos. Reading, MA: Addison-Wesley.
1650
P. H. E. Tiesinga, J.-M. Fellous, and Terrence J. Sejnowski
Tang, A., Bartels, A., & Sejnowski, T. (1997). Effects of cholinergic modulation on responses of neocortical neurons to uctuating input. Cereb. Cortex, 7, 502–509. Tiesinga, P. (2001). Information transmission and recovery in neural communications channels revisited. Phys. Rev. E, 64, 012901: 1–4. Victor, J., & Purpura, K. (1996). Nature and precision of temporal coding in visual cortex: a metric-space analysis. J. Neurophysiol., 76, 1310–1326. Warzecha, A., Kretzberg, J., & Egelhaaf, M. (1998). Temporal precision of the encoding of motion information by visual interneuron. Curr. Biol., 8, 359– 368. Warzecha, A., Kretzberg, J., & Egelhaaf, M. (2000). Reliability of a y motionsensitive neuron depends on stimulus parameters. J. Neurosci., 20, 8886–8896. Zador, A. (1998). Impact of synaptic unreliability on the information transmitted by spiking neurons. J. Neurophys., 79, 1219–1229. Received July 25, 2001; accepted January 4, 2002.
LETTER
Communicated by Paul Bressloff
Traveling Waves of Excitation in Neural Field Models: Equivalence of Rate Descriptions and Integrate-and-Fire Dynamics Daniel Cremers
[email protected] Department of Mathematics and Computer Science, University of Mannheim, 68131 Mannheim, Germany Andreas V. M. Herz
[email protected] InnovationskollegTheoretischeBiologie,Humboldt-Universit¨at zu Berlin, 10115Berlin, Germany Field models provide an elegant mathematical framework to analyze large-scale patterns of neural activity. On the microscopic level, these models are usually based on either a ring-rate picture or integrate-andre dynamics. This article shows that in spite of the large conceptual differences between the two types of dynamics, both generate closely related plane-wave solutions. Furthermore, for a large group of models, estimates about the network connectivity derived from the speed of these plane waves only marginally depend on the assumed class of microscopic dynamics. We derive quantitative results about this phenomenon and discuss consequences for the interpretation of experimental data. 1 Introduction
Traveling waves of synchronized neural excitation occur in various brain regions and play an important role both during development and for information processing in the adult. Such waves have been observed in various systems, including the retina (Meister, Wong, Denis, & Shatz, 1991), olfactory bulb (Gelperin, 1999), visual cortex (Bringuier, Chavane, Glaeser, & Fregnac, 1999), and motor cortex (Georgopoulos, Kettner, & Schwartz, 1988). In vitro, excitation waves have been successfully induced to probe single-neuron dynamics as well as large-scale connectivity patterns (Traub, Jefferys, & Miles, 1993; Wadman & Gutnick, 1993; Golomb & Amitai, 1997). The omnipresence of traveling waves in neural tissues is reected by a wealth of models, ranging from detailed biophysical approaches (Traub et al., 1993; Destexhe, Bal, McCormick, & Sejnowski, 1996; Golomb & Amitai, 1997; Rinzel, Terman, Wang, & Ermentrout, 1998) to highly simplied mathematical formulations (Beurle, 1956; von Seelen, 1968; Wilson & Cowan, c 2002 Massachusetts Institute of Technology Neural Computation 14, 1651– 1667 (2002) °
1652
Daniel Cremers and Andreas V. M. Herz
1973; Amari, 1977; Ermentrout & Cowan, 1979; Idiart & Abbott, 1993; BenYishai, Hansel, & Sompolinsky, 1997; Horn & Opher, 1997; Ermentrout, 1998; Kistler, Seitz, & van Hemmen, 1998; Bressloff, 1999; Golomb & Ermentrout, 1999; Bressloff, 2000; Kistler, 2000; Golomb & Ermentrout, 2001). Models of the latter type have to sacrice biological realism to a certain extent, but they often admit a quantitative analysis of the relation between the accessible macroscopic wave phenomena and otherwise hidden aspects of the neural dynamics and connectivity patterns. In some mathematical models, for example, the speed of the emergent traveling wave can be expressed in closed form in terms of the parameters describing single-neuron behavior and network circuitry (Idiart & Abbott, 1993; Ben-Yishai et al., 1997; Horn & Opher, 1997; Ermentrout, 1998; Kistler et al., 1998; Bressloff, 1999, 2000; Golomb & Ermentrout, 1999, 2001; Kistler, 2000). Given one of these models, results from large-scale neurophysiological measurements may be directly interpreted in terms of microscopic neural parameters. Most of the mathematical models describing spatiotemporal activity patterns are formulated as eld models in continuous space. Within these models, two broad classes can be distinguished: ring-rate models and models with spiking neurons. The former, more traditional, eld models are based on a locally averaged neuronal ring rate. The latter, more recent, class of eld models incorporates the existence of action potentials. Here, the local dynamics involve the generation of action potentials, for example, through an integrate-and-re mechanism. A traveling wave in a ring-rate model typically corresponds to the spread of a region with a high ring rate (Wilson & Cowan, 1973; Amari, 1977; Idiart & Abbott, 1993). As depicted in Figure 1, this is in sharp contrast to the simplest traveling wave in a system with spiking neurons, where both before and directly behind the wave front, activity is low so that the resulting neural activation pattern is highly localized in space and time. Depending on the refractory period, synaptic timescales, and heterogeneity of intrinsic and coupling properties, the activation pattern in a model with spiking neurons may also exhibit complicated spatiotemporal structures, including multiple reexcitations behind the wave front. Numerical simulations show, however, that these phenomena have almost no effect on the speed of the wave front (Ermentrout, 1998). Independent of the precise structure of traveling waves in ring-rate models and systems with spiking neurons, there is no doubt that the two types of wave correspond to entirely different dynamical scenarios. Are these two scenarios nevertheless related to each other on a mathematical level, even if they describe two highly distinct biological situations? If so, can results obtained within one framework be applied to phenomena observed in the other setting? This article addresses these questions and shows that surprisingly there is a close connection between the two model classes. In particular, it is demonstrated that for a large group of commonly used systems, there exists
Traveling Waves of Excitation
1653
Figure 1: Schematic space-time plots of the simplest traveling wave in a onedimensional ring-rate model (left) and an integrate-and-re model (right). The spatial activity at two different times is highlighted by thick lines. In the ringrate model, the wave front connects a region with a low ring rate with a region with a high ring rate. In the integrate-and-re model, on the other hand, activity is low both before and directly behind the wave front. Reexcitation may lead to more complicated activity proles but has almost no effect on the speed of the wave front. This justies the study of simplied integrate-and-re models with a single narrow activity peak as depicted on the right side.
a one-to-one mapping between the two classes. This mapping involves a nontrivial geometric transformation of the neural connectivity pattern. As a consequence, biological interpretations of experimental measurements in terms of the underlying neural circuitry may depend strongly on the assumed class of model dynamics. We derive quantitative results about this effect, discuss implications for data analysis, and close with some remarks on extensions and limitations of this approach. 2 Field Models Based on Mean-Firing-Rate Descriptions
In traditional eld models such as those proposed by Wilson and Cowan (1973) or Amari (1977), neural activity is treated as a phenomenological variable u ( x, t) that represents the local short-time averaged membrane potential. In this continuum approximation, the time evolution of neural activity is described by a partial differential equation, for example, the frequently used prototype t
@u ( x, t) @t
Z D ¡u(x, t) C
Z ¢
t ¡1
dt0 a(t ¡ t0 )
µ ³ ´¶ |x ¡ y| 0 . dy J ( x ¡ y ) g u y, t ¡ c
(2.1)
It describes the dynamics of one homogeneous population of neurons. The parameter t denotes an effective membrane time constant, and J ( z ) models the distance dependence of synaptic coupling strengths, often in the
1654
Daniel Cremers and Andreas V. M. Herz
form of a Mexican hat type of interaction with strong short-range excitation surrounded by weak inhibition. A prominent example for the functional form of such distance-dependent coupling strengths is the difference of two gaussians. The function g ( u) characterizes the nonlinear dependence of the ring rate on the mean local membrane potential. Such activation functions are usually modeled by sigmoid functions, that is, monotone increasing and s-shaped functions that approach zero for u ! ¡1 and saturate for u ! C 1. Note that within this class of model, ring rates and mean membrane potentials are treated on the same footing, simply related by the static nonlinearity g. The kernel a captures the dynamical effects of signal delays and (post)synaptic integration processes. For example, a(s) D d ( s ¡ taxon ) describes a discrete uniform delay, and ¡1 ¡s / tpsp ( ) a(s) D tpsp H s e
(2.2)
¡2 ¡s/ tpsp a(s) D stpsp H ( s) e
(2.3)
or
mimic two commonly used time courses of postsynaptic potentials. Distance-dependent axonal propagation delays are taken care of by the term |x ¡ y| / c in the argument of u in equation 2.1, where c denotes the signal propagation velocity. In the mathematical analysis, we will neglect the effects of nite c and set 1 / c D 0 for simplicity. From a neurobiological point of view, this is justied if characteristic velocities of the large-scale activity patterns are small compared to the axonal signal velocity. All our results do, however, generalize to nite propagation speed. In what follows, we assume that, on average, excitatory synaptic interactions are stronger than inhibitory interactions. This means that the temporal and spatial kernels can be taken to be normalized to plus one, Z
1 0
ds a(s) D 1
(2.4)
dz J (z ) D 1.
(2.5)
and Z
1 ¡1
Temporal integration of equation 2.1 results in the integral equation Z u ( x, t ) D
t ¡1
Z dt0 2 ( t ¡ t0 )
dy J ( x ¡ y ) g[u ( y, t0 ) ]
(2.6)
Traveling Waves of Excitation
1655
with Z ¡1 2 ( s) D t
s
ds 0 e ¡
s¡s0 t
a(s0 ) .
(2.7)
0
As a consequence of equation 2.4, the kernel 2 is also normalized, Z
1 0
ds 2 ( s) D 1.
(2.8)
As will become apparent in the following sections, the integral formulation, equation 2.6, is most suitable for discussing wave phenomena. 3 Wave Fronts in One-Dimensional Firing-Rate Models
Because of the normalizations 2.5 and 2.8, all homogeneous and stationary solutions u ( x, t) D u of equation 2.6 satisfy a simple xed-point equation, u D g ( u) .
(3.1)
Depending on the shape of g, equation 3.1 allows a single or multiple solutions. Of particular interest are sigmoid functions with three solutions ur < uu < ue . Here, ur corresponds to the stable rest state of a neuron (low ring rate), ue , to the stable excited state (high ring rate), and uu , to the intermediate unstable equilibrium. In this case of a bistable network, a traveling wave joining the rest state ur and the excited state ue can be triggered for appropriate initial conditions. For example, if limx!¡1 u ( x, 0) D ue and limx!1 u (x, 0) D ur , a wave of excitation may propagate in the positive x-direction through the system. If g vanishes for u less than a xed threshold # > 0, g ( u) D 0
u < #,
for
(3.2)
the rest state is given by ur D 0, and nding traveling wave fronts of equation 2.6 simplies signicantly, as shown by Idiart and Abbott (1993). Consider again a wave propagating in the positive x-direction that reaches the point x 0 at time t0, that is, u ( x 0, t0 ) D #. Since the wave is approaching from the left, g[u ( y, t0 ) ] D 0 for all y > x 0 as long as t0 < t0 . Thus, although in general the activity u ( y, t0 ) itself is nonzero in front of the wave, due to the condition 3.2, this has no effect on the dynamics for u < #. (See Figure 2.) Evaluated at x D x0 and t D t0 , equation 2.6 therefore reads Z # D u ( x 0, t 0 ) D
t0 ¡1
Z dt0 2 ( t0 ¡ t0 )
x0 ¡1
dy J ( x0 ¡ y) g[u ( y, t0 )].
(3.3)
1656
Daniel Cremers and Andreas V. M. Herz
u(x,t) t=t0
g[u(x,t0)]
t x 0 because a threshold nonlinearity with g ( u ) D 0 for u < # was chosen.
If one inserts the traveling-wave ansatz, u ( x, t ) D uQ ( t ¡ x / v ) ,
(3.4)
into equation 3.3, inuences of the structure of J and the shape of g on the propagation velocity v may readily be analyzed (Idiart & Abbott, 1993). 4 Field Models Based on Integrate-and-Fire Neurons
The models described above are based on a ring-rate description of neural activity and neglect the existence of action potentials. More recently, a different class of eld models has been introduced that explicitly incorporates the spiking nature of neural activity (Ermentrout, 1998; Kistler et al., 1998; Bressloff, 1999, 2000; Golomb & Ermentrout, 1999, 2001; Kistler, 2000). On the single-cell level, the most salient features of spiking neurons are captured by integrate-and-re models. Here, each neuron integrates synaptic inputs and generates a uniform action potential whenever the membrane potential crosses a xed threshold from below. Simulations with networks of regularly spaced integrate-and-re neurons reveal a large variety of macroscopic activity patterns such as plane waves, rotating spirals and expanding rings (see, for example, Fohlmeister, Gerstner, Ritz, & van Hemmen, 1995; Ermentrout, 1998; Horn & Opher, 1997; Kistler et al., 1998).
Traveling Waves of Excitation
1657
The length of the refractory period inuences the prole of traveling waves in systems with spiking neurons. Depending on the model parameters, shapes range from narrow soliton-like excitations, where each neuron res only once, to complex activity proles due to multiple reexcitations behind the wave front. However, extensive numerical investigations provide strong evidence that the speed of a traveling wave front is largely independent of subsequent spike activity (Ermentrout, 1998). This result justies the study of models without reexcitation, such as one-dimensional systems of the type t
@V ( x, t ) @t
Z D ¡V(x, t) C
1 ¡1
dy C ( x ¡ y ) a[t Q ¡ t?( y ) ]
(4.1)
where V ( x, t) describes the local membrane potential, C ( z) represents synaptic coupling strengths, and a Q captures transmission phenomena as in section 2. The time t? ( y ) denotes the ring time of a neuron at location y, determined by the conditions V ( y, t?) D # and @V ( y, t) / @t > 0 for t D t? (Ermentrout, 1998). With the denition Z s s¡s0 r ( s) D t ¡1 (4.2) ds0 e ¡ t a(s Q 0), 0
integration of equation 4.1 results in the integral equation Z V ( x, t ) D
Z
t ¡1
dt0 r ( t ¡ t0 )
1 ¡1
dy C (x ¡ y)d[V ( y, t0 ) ¡ #]
µ ¶ @V ( y, t0 ) @V ( y, t0 ) ¢ 0 H , @t @t0
(4.3)
where H (¢) denotes the Heaviside step function. A dynamical description that is closely related to equation 4.3 was obtained by Kistler (2000), who also provided a mathematical proof that the dynamics of a spatially discretized network approaches the dynamics of a neural eld model if the characteristic length scale of neural interactions is much larger than the grid size. In this limit, the membrane potential V at location x and time t becomes Z V ( x, t ) D
Z
t ¡1
Z
C
dt0 r ( t ¡ t0 ) t
¡1
1 ¡1
dy C (x ¡ y) S ( y, t0 )
dt0 g ( t ¡ t0 ) S (x, t0 ) ,
(4.4)
where spike activity at location y and time t0 is now denoted by S ( y, t0 ) , and r and C model the temporal and spatial aspects of signal transmission and
1658
Daniel Cremers and Andreas V. M. Herz
integration similar to equation 2.1. Refractoriness following local spike activity results from hyperpolarization, whose time course is described byg ( s) . Spike activity is triggered whenever the local eld crosses a threshold # from below, µ ¶ @V ( x, t) @V ( x, t) H . (4.5) S ( x, t ) D d[V(x, t) ¡ #] @t @t According to the derivation given by Kistler, the variables V ( x, t) and S ( x, t) can be interpreted as interpolated versions of the corresponding variables V (xi , t) and S ( xi , t) of the original spatially discretized network, where neuron i, with 1 · i · N, is located at site xi . From a neurophysiological point of view, this implies that V ( x, t) and S ( x, t) mimic different components of the local eld potential recorded at location x, with V ( x, t) representing averaged postsynaptic potentials and S (x, t) describing the effects of spike activity. Equation 4.5 implies that elementary spike activity is described by a dfunction whose size, when integrated over time, is normalized to unity (Ermentrout, 1998; Bressloff, 1999, 2000; Golomb & Ermentrout, 1999; Kistler, 2000). This normalization reects the unitary character of action potentials. ) | is reIn an alternative approach by Kistler et al. (1998), the factor | @V@( x,t t @V ( x,t ) placed by | @x | so that spike activity is normalized when integrated in the spatial domain. However, as pointed out by Bressloff (1999), and acknowledged in Kistler (2000), only the rst description, equation 4.5 is biophysically justied. Focusing on the wave front of a traveling wave, we may neglect the possible occurrence of reexcitations and set the second term on the right-hand side of equation 4.4 to zero without loss of generality. Inserting equation 4.5 into 4.4 then yields 4.3 as desired. 5 Mapping the Firing-Rate Model onto the Integrate-and-Fire Model
Although at rst glance, equation 4.3 is somewhat reminiscent of 2.6, the two classes of neural eld models are quite different from a conceptual point of view. The rst assumes that the response properties of a neuron are fully described by its ring rate. As a consequence, the local eld u (x, t) in equation 2.1 represents a postsynaptic membrane potential that has been averaged in both space and time. Within the integrate-and-re eld model, in contrast, no temporal average is carried out; the variable S ( x, t) in equation 4.5 represents spike activity whose time evolution is given by an integrate-and-re mechanism. Despite these differences, both types of model show similar macroscopic patterns such as traveling waves of excitation, but the details of these waves differ signicantly, as illustrated in Figure 1.
Traveling Waves of Excitation
1659
Figure 3: Corresponding synaptic connectivities in ring-rate and integrateand-re models. Each of the two panels, A and B, shows one example of a J coupling (ring-rate model, left side) and a T coupling (integrate-and-re model, right side). In the upper panel, gaussian couplings are assumed for the ring-rate model and lead to a strongly peaked coupling distribution for the integrate-and-re model. In the lower panel, gaussian couplings are assumed for the integrate-and-re model and imply a coupling distribution for the ringrate model that is peaked away from zero.
What is the relation between the two approaches? To investigate this questions, we will now map the dynamics of the ring-rate model onto the dynamics of the integrate-and-re model. We focus on the case of a propagating wave front as described in section 3 and denote its propagation velocity by v. Furthermore, we assume symmetric J couplings, J ( z) D J (¡z) , that are continuous and integrable. We introduce auxiliary couplings T (z ) by ( T ( z ) :D
v ¡1 v
Rz
R¡1 ¡1 1 z
dz0 J ( z0 ) , dz0 J ( z0 ) ,
z < 0, z ¸ 0.
(5.1)
(See Figure 3.) Due to the assumptions for J (z ) , these new couplings are also symmetric, T ( z ) D T (¡z) , and continuous at z D 0 with T (0) D
1 . 2v
(5.2)
1660
Daniel Cremers and Andreas V. M. Herz
In terms of the T couplings, the original J couplings are given by ( J ( z) D
d C v dz T (z) ,
d ¡v dz
T (z) ,
z < 0,
(5.3)
z ¸ 0.
By denition, the T couplings vanish for large positive and negative argument. Note that apart from the special case of exponential couplings, J ( z ) / exp(¡|z| / s) , the couplings J ( z) and T ( z ) do not have the same shape. Using integration by parts, we obtain for the second integral in equation 3.3, Z x0 dy J ( x 0 ¡ y ) g[u ( y, t0 ) ] ¡1
Dv
Z
³
x0
dy ¡1
@ @y
´ g[u ( y, t0 ) ]
T ( x0 ¡ y) Z
D v T (0) g[u ( x 0, t0 ) ] ¡ v
x0 ¡1
dy T ( x0 ¡ y) h[u ( y, t0 ) ]
@u ( y, t0 ) @y
(5.4)
with h ( u ) :D
dg ( u ) . du
(5.5)
Because T (0) is nite and g[u (x 0 , t0 ) ] vanishes for all t0 < t 0, inserting equation 5.4 into equation 3.3 results in # D u ( x0 , t 0 ) Z t0 Z D v dt0 2 ( t0 ¡ t0 )
@u ( y, t0 ) dy T ( x0 ¡ y ) h[u ( y, t0 ) ] @y ¡1 ¡1 Z t0 Z x0 @u ( y, t0 ) D dt0 2 ( t0 ¡ t0 ) dy T (x 0 ¡ y ) h[u ( y, t0 )] , @t0 ¡1 ¡1 x0
(
0
)
(
0
(5.6)
)
@u y,t @u y,t where we have employed the identity v ¢ | @y | D | @t0 | and the fact that the wave is traveling in the positive x-direction so that @u / @y · 0. Due to the particular denition of the T couplings (see equation 5.1), equation 5.6 also holds for waves traveling in the negative x-direction. In d this case, @u / @y ¸ 0 and J ( x0 ¡ y ) D ¡ dy T ( x 0 ¡ y ) in the relevant region of space (y > x 0). The above calculations hold for any kind of threshold nonlinearity (see equation 3.2). For the commonly considered case where the nonlinearity g ( u ) is a step function (see, for example, Amari, 1977),
g (u) D H ( u ¡ #) ,
(5.7)
Traveling Waves of Excitation
1661
the function h becomes a d-function, h ( u) D d (u ¡ #) ,
(5.8)
and equation 5.6 reduces to # D u ( x 0, t 0 ) Z t0 Z D dt0 2 ( t0 ¡ t0 ) Z D
¡1 t0 ¡1
Z dt0 2 ( t0 ¡ t0 )
x0 ¡1 x0 ¡1
@u ( y, t0 ) dy T (x 0 ¡ y )d[u ( y, t0 ) ¡ #] @t0 dy T (x 0 ¡ y )d[u ( y, t0 ) ¡ #]
µ ¶ @u ( y, t0 ) @u ( y, t0 ) ¢ H . @t0 @t0
(5.9)
In the last step, we used the fact that @u / @t > 0 at threshold crossing. The comparison of equation 5.9 with equation 4.3, evaluated for x D x 0 and t D t0 , reveals that equation 5.9 is identical with the threshold condition for a wave front in the integrate-and-re eld model if we identify u (x, t) D V ( x, t) , 2 (t ) D r (t ) , and C ( z ) D T ( z) . This nding implies that the wave front in a one-dimensional eld model with step nonlinearity is equivalent to the wave front in a eld model with integrate-and-re neurons. Note, however, that the spatial couplings in both models are not the same but related via the transformation 5.1. More generally, even ring-rate models with smooth sigmoid nonlinearities g ( u ) can be mapped onto integrate-and-re models, as shown by equation 5.6. In that case, the shape of an action potential in the associated integrate-and-re model is not any longer given by a d-function as for step nonlinearities but rather by the expression S (x, t) D
dg[u (x, t) ] @u ( x, t) . @t du
(5.10)
Depending on the shape of g, different action-potential shapes may thus be modeled. 6 From Macroscopic Wave Phenomena to Microscopic Dynamics
Let us investigate the interpretation of an electrophysiological experiment of the type performed by Traub et al. (1993), Wadman and Gutnick (1993), or Golomb and Amitai (1997) in which a wave velocity v has been measured in some neural substrate. What are the differences between the inferred system parameters depending on whether we base our calculations on a rate model, as in Idiart and Abbott (1993) or an integrate-and-re mechanism
1662
Daniel Cremers and Andreas V. M. Herz
as in Ermentrout (1998)? This question is an extreme case of a situation often encountered in a model-based analysis of neurophysiological data: How do assumptions about the underlying dynamics inuence the system parameters derived from experimental measurements? In the present case, we can answer this question analytically and gain further insight into the scope and limitations of deriving network characteristics from the properties of waves of neural excitation. To do so, we assume some functional form for the true synaptic couplings C ( z) . For concreteness, we take the couplings to be gaussians with standard deviation s. Within the ring-rate model, we therefore set J ( z) D C (z ) and denote s by sJ . Within the integrate-and-re model, we have to derive J ( z) through equation 5.3 from gaussian couplings T ( z ) with s D sT . The T couplings thus have the same shape as C ( z) but are normalized according to equation 5.2 so that in both scenarios, the J couplings satisfy the normalization 2.5 to guarantee an unbiased comparison. Keeping all other parameters xed, we then derive analytical expressions for sJ and sT , compare their values, and may thus judge how strongly the inferred network parameters depend on the assumptions about the single-cell dynamics. We present three different examples of this approach. In the rst example, we investigate the general model, equation 2.1, with nite signal propagation speed c < 1. Inserting the traveling wave ansatz 3.4 into equation 3.3 and integrating by parts with respect to t0 , we can extend the approach of Idiart and Abbott (1993), who focused on the specic case a(s) D d ( s) . We obtain for step nonlinearities, g (u ) D H (u ¡ # ) , the following implicit equation for v, Z
Z
0
¡c z
dz J (z )
#D
¡1
0
2 ( s) ds,
(6.1)
with c D v ¡1 ¡ c ¡1 ,
(6.2)
where the wave is, as before, traveling in the positive x-direction. Assuming that the shape a(s) of a postsynaptic potential is exponentially decaying (see equation 2.2), we get 2 ( t) D
1 (e ¡t / tpsp ¡ e¡t / t ) H ( t) , tpsp ¡ t
(6.3)
and equation 6.1 becomes Z #D
0 ¡1
µ ( ) dz J z 1 ¡
¶ 1 (tpspe ¡c |z| / tpsp ¡ t e¡c |z| / t ) . tpsp ¡ t
(6.4)
Traveling Waves of Excitation
1663
To simplify the mathematics, we consider couplings whose spatial range is small compared to the distance the wave travels in the time interval tQ D minft, tpspg relevant for the microscopic dynamics. In other words, we focus on the case where Z 0 (6.5) dz |z| n J ( z ) ¿ ( vtQ ) n 8n 2 N . ¡1
In this regime, we can neglect higher-order terms in the expansion of the exponential functions in equation 6.4 and obtain #¼
c2 t tpsp
Z
0
dz J (z ) z2 .
(6.6)
¡1
To answer the question posed in the beginning of this section, we now conR0 sider on the one hand gaussian J couplings and arrive at ¡1 dz J ( z ) z2 D sJ2 / 2. If, on the other hand, we use gaussian T couplings with proper norR0 malization (see equation 5.2), we obtain ¡1 dz J ( z ) z2 D sT2 . Independent of values for t , tpsp , c, and v, the estimates for sJ and sT are therefore always related by p sJ D 2sT . (6.7) We now turn to a second example and assume instantaneous interactions without any signal delays, a(s) D d (s ), so that 2 (s ) D t ¡1 e ¡s/ t . For gaussian J couplings, we can solve equation 6.1 and obtain Á 2 2! ³ ´ c sJ ¡c sJ 1 erf (6.8) # D ¡ exp , 2 2t 2 t where erf(x) denotes the error function. If, on the other hand, we take T couplings with gaussian shape, we get Á !r ³ ´ c 2 sT2 1 p c sT ¡c sT erf . (6.9) # D ¡ exp 2 2t 2 2 t t Equating the right-hand sides of equations 6.8 and 6.9 forc s ¿ t , we obtain sJ D
p sT . 2
(6.10)
Finally, if instead of gaussian couplings, we use block-shaped couplings, J (z ) D 1 / (2aJ ) for ¡a J · z · a J and zero elsewhere, a calculation similar to equations 6.8 through 6.10 leads to sJ D 2sT .
(6.11)
1664
Daniel Cremers and Andreas V. M. Herz
Thus, in these three cases investigated, the width of the same type of synaptic coupling inferred from the measured wave velocity differs by a factor 1.4 to 2 between the ring-rate and integrate-and-re picture. Furthermore, we obtain sJ > sT in all three cases. For gaussian couplings C, Figure 3 offers a heuristic explanation of this result: the right-hand side of equation 6.1 depends on the higher-order moments of J ( z) . For gaussian T couplings (lower right in the gure) with given standard deviation sT , the higher-order moments of the derived J couplings (lower left in the gure) are smaller than those of gaussian J couplings (upper left) with the same standard deviation, sJ D sT . To compensate for this difference, we have to choose sJ > sT as veried by the analytical calculation. 7 Discussion
As shown in this article, wave fronts of traveling waves in ring-rate models and models with integrate-and-re dynamics are intimately related on a formal level. Taking into account that these descriptions are extreme and opposite caricatures of the true neural dynamics, the differences in the inferred model parameters are surprisingly small and suggest that more elaborate biophysical models might show intermediate results. We may thus conclude that macroscopic wave phenomena can be used reliably to estimate the characteristics of neural dynamics and network architecture that are not directly observable. However, it should be noted that our calculations are based on simple homogeneous single-layer models. Neural tissues with multiple layers, more complicated feedback structures, or additional slow dynamical components might support traveling waves of a rather different nature, as also suggested by results of Rinzel et al. (1998) and Golomb and Ermentrout (1999, 2001). The emergent properties of such neural systems might depend sensitively on the biophysical details of the underlying dynamics. Our studies have been restricted to integrate-and-re models that do not exhibit reexcitation. This approach is justied by the observation of Ermentrout (1998) that the speed of a traveling wave front is largely independent of subsequent spiking activity. More elaborate modeling frameworks could also be used to describe nonvanishing asynchronous activity ahead and behind the traveling wave front and to include heterogeneity of neural and synaptic properties. We believe that these extensions offer interesting topics for further research but that they will not change the overall picture about the relation of wave phenomena in ring-rate and integrate-and-re models. In addition, traveling wave solutions in both types of model are likely to share the same stability properties, but we have not been able to prove this analytically. However, even if the stability of two corresponding solutions differed under certain circumstances, our analysis would still be helpful in that it allows nding unstable waves in one modeling framework by searching for all stable solutions in the
Traveling Waves of Excitation
1665
other framework, which, from a numerical point of view, is a much simpler task. The equivalence of ring-rate and integrate-and-re models also holds for plane waves in two- and higher-dimensional systems. For concreteness, let us assume that the wave in a three-dimensional ring-rate model is propagating in the x-direction so that u ( x, y, z, t) depends on only x and t. We may then reduce the original system to an effective one-dimensional R1 R 1 system by dening effective Jeff -couplings through Jeff ( x ) D ¡1 dy ¡1 dz J ( x, y, z ) and compute the corresponding Teff -couplings as in equation 5.1. We can therefore describe the original wave within either a one-dimensional ringrate or a one-dimensional integrate-and-re picture. In a further step, we may also wish to know which three-dimensional integrate-and-re models are compatible with the one-dimensional model. To this question, we R 1answer R1 have to search for couplings T ( x, y, z) that satisfy ¡1 dy ¡1 dz T (x, y, z ) D Teff ( x) . This problem is underdetermined so that we may set additional constraints on T ( x, y, z ). For example, we can ask which isotropic and homogeQ T (x, y, z ) D TQ ( x2 C y2 C z2 ) correspond to a given Teff . neous couplings T, Although successful for plane wave solutions, our approach is not useful for circular or spiral waves. These solutions break the translation symmetry of the underlying network and distinguish one specic location: the origin of the expanding wave. If we nevertheless apply our methods and start from homogeneous couplings in the ring-rate or integrate-and-re framework, we obtain inhomogeneous couplings in the other framework. Furthermore, the coupling strengths explicitly reect the location of the origin of the specic wave solution. With these limitations in mind, the current approach may help to unify different concepts of neural eld dynamics. Let us close with a nal observation about the relation between ringrate and integrate-and-re neural network models. Firing-rate dynamics are usually derived from models with spiking neurons by taking shorttime averages. In eld models with spatially continuous neural activity, a conceptually different method becomes possible: spatial integration of the dynamical variables, together with spatial differentiation of the synaptic couplings. For traveling waves, no information is lost by this transformation. This shows that under certain circumstances, microscopic dynamical descriptions as different as ring-rate and integrate-and-re models may exhibit equivalent patterns of large-scale activity.
Acknowledgments
We thank Katrin Suder and Florentin Worg¨ ¨ otter for stimulating discussions and Wulfram Gerstner, Werner Kistler, Martin Stemmler, Laurenz Wiskott, and two referees for most helpful comments on the manuscript. This work has been supported by the Deutsche Forschungsgemeinschaft and the Human Frontier Science Program.
1666
Daniel Cremers and Andreas V. M. Herz
References Amari, S. I. (1977). Dynamics of pattern formation in lateral-inhibition type neural elds. Biological Cybernetics, 27, 77–87. Ben-Yishai, R., Hansel D., & Sompolinsky H. (1997). Traveling waves and the processing of weakly tuned inputs in a cortical network module. Journal of Computational Neuroscience, 4, 57–77. Beurle, R. L. (1956). Properties of a mass of cells capable of regenerating pulses. Philosophical Transactions of the Royal Society B, 240, 55–94. Bressloff, P. C. (1999). Synaptically generated wave propagation in excitable neural media. Physical Review Letters, 82, 2979–2982. Bressloff, P. C. (2000). Traveling waves and pulses in a one-dimensional network of excitable integrate-and-re neurons. Journal of Mathematical Biology, 40, 169–198. Bringuier, V., Chavane, F., Glaeser, L., & Fregnac, Y. (1999). Horizontal propagation of visual activity in the synaptic integration eld of area 17 neurons. Science, 283, 695–699. Destexhe, A., Bal, T., McCormick, D. A., & Sejnowski, T. J. (1996). Ionic mechanisms underlying synchronized oscillations and propagating waves in a model of ferret thalamic slices. Journal of Neurophysiology, 76, 2049–2070. Ermentrout, G. B. (1998).The analysis of synaptically generated traveling waves. Journal of Computational Neuroscience, 5, 191–208. Ermentrout, B., & Cowan, J. (1979). A mathematical theory of visual hallucination patterns. Biological Cybernetics, 34, 137–150. Fohlmeister, C., Gerstner, W., Ritz, R., & van Hemmen, J. L. (1995). Spontaneous excitations in the visual cortex: Stripes, spirals, rings, and collective bursts. Neural Computation, 7, 1046–1055. Gelperin, A. (1999). Oscillatory dynamics and information processing in olfactory systems. Journal of Experimental Biology, 202, 1855–1864. Georgeopoulos, A. P., Kettner, R. E., & Schwartz, A. B. (1988). Primate motor cortex and free arm movements to visual targets in three-dimensional space. II. Coding of the direction of movement by a neuronal population. Journal of Neuroscience, 8, 2928–2937. Golomb, D., & Amitai, Y. (1997). Propagating neuronal discharges in neocortical slices: Computational and experimental study. Journal of Neurophysiology,78, 1199–1211. Golomb, D., & Ermentrout, G. B. (1999). Continuous and lurching traveling pulses in neuronal networks with delay and spatially decaying connectivity. Proc. Natl. Acad. Sci. USA, 96, 13480–13485. Golomb, D., & Ermentrout, G. B. (2001). Bistability in pulse propagation in networks of excitatory and inhibitory populations. Phys. Rev. Lett., 86, 4179– 4182. Horn, D., & Opher, I. (1997). Solitary waves of integrate-and-re neural elds. Neural Computation, 9, 1677–1690. Idiart, M. A. P., & Abbott, L. F. (1993). Propagation of exitation in neural network models. Network, 4, 285–294.
Traveling Waves of Excitation
1667
Kistler, W. M. (2000). Stability properties of solitary waves and periodic wave trains in a two-dimensional network of spiking neurons. Phys. Rev. E, 62(6), 8834–8837. Kistler, W. M., Seitz, R., & van Hemmen, J. L. (1998). Modeling collective excitations in cortical tissue. Physica D, 114, 273–295. Meister, M., Wong, R. O. L., Denis, A. B., & Shatz, C. J. (1991). Synchronous bursts of action potentials in ganglion cells of the developing mammalian retina. Science, 252, 939–943. Rinzel, J., Terman, D., Wang, X.-J., & Ermentrout, B. (1998). Propagating activity patterns in large-scale inhibitory neuronal networks. Science, 279, 1351–1355. Traub, R. D., Jefferys, J. G. R., & Miles, R. (1993). Analysis of propagation of disinhibition-induced after-discharges along the guinea-pig hippocampal slice in vitro. Journal of Physiology, 472, 267–287. von Seelen, W. (1968). Informationsverarbeitung in homogenen Netzen von Neuronenmodellen I. Kybernetik, 5, 133–148. Wadman, W. J., & Gutnick, M. J. (1993). Non-uniform propagation of epileptiform discharge in brain slices of rat neocortex. Neuroscience, 52, 255–262. Wilson, H. R., & Cowan, J. D. (1973). A mathematical theory of the functional dynamics of cortical and thalamic nervous tissue. Kybernetik, 13, 55–80. Received January 18, 2000; accepted January 4, 2002.
LETTER
Communicated by Alexandre Pouget
Attentional Recruitment of Inter-Areal Recurrent Networks for Selective Gain Control Richard H. R. Hahnloser
[email protected] Howard Hughes Medical Institute, Department of Brain and Cognitive Sciences, MIT, Cambridge, MA 02139 U.S.A. Rodney J. Douglas
[email protected] Institute of Neuroinformatics ETHZ/UNIZ, CH-8057 Zurich, ¨ Switzerland Klaus Hepp
[email protected] Institute for Theoretical Physics, ETHZ, H¨onggerberg, CH-8093 Zurich, ¨ Switzerland There is strong anatomical and physiological evidence that neurons with large receptive elds located in higher visual areas are recurrently connected to neurons with smaller receptive elds in lower areas. We have previously described a minimal neuronal network architecture in which top-down attentional signals to large receptive eld neurons can bias and selectively read out the bottom-up sensory information to small receptive eld neurons (Hahnloser, Douglas, Mahowald, & Hepp, 1999). Here we study an enhanced model, where the role of attention is to recruit specic inter-areal feedback loops (e.g., drive neurons above ring threshold). We rst illustrate the operation of recruitment on a simple example of visual stimulus selection. In the subsequent analysis, we nd that attentional recruitment operates by dynamical modulation of signal amplication and response multistability. In particular, we nd that attentional stimulus selection necessitates increased recruitment when the stimulus to be selected is of small contrast and of small distance away from distractor stimuli. The selectability of a low-contrast stimulus is dependent on the gain of attentional effects; for example, low-contrast stimuli can be selected only when attention enhances neural responses. However, the dependence of attentional selection on stimulus-distractor distance is not contingent on whether attention enhances or suppresses responses. The computational implications of attentional recruitment are that cortical circuits can behave as winner-take-all mechanisms of variable strength and can achieve close to optimal signal discrimination in the presence of external noise.
c 2002 Massachusetts Institute of Technology Neural Computation 14, 1669– 1689 (2002) °
1670
Richard H. R. Hahnloser, Rodney J. Douglas, and Klaus Hepp
1 Introduction
The primate visual cortex is divided into many distinct areas that are organized hierarchically by recurrent inter-areal connections. The various areas represent visual space nearly topographically, but the sensory receptive elds of their neurons are quite different. It appears that the role of interareal ascending projections is to form progressively higher representations of the visual world, for example, from neurons tuned to moving edges in V1 (Hubel & Wiesel, 1962) up to neurons tuned to the view of particular objects in area IT (Logothetis, Pauls, Bulthoff, & Poggio, 1994; Riesenhuber & Poggio, 1999). The size of neuronal receptive elds grows with hierarchical level. For example, in the macaque monkey, classical receptive eld diameters grow from about 1 degree in V1 to 5 degrees in MT, 30 degrees in MST, and in IT they can be even larger than the entire contralateral visual eld. There is evidence that the descending inter-areal connections affect attentional selection and memorization of behaviorally relevant information (Motter, 1993; Tomita, Ohbayashi, Nakahara, & Miyashita, 1999). Attentionrelated neural activity has been observed in many cortical visual areas of the macaque monkey. The strength of attention decreases with level from MST, MT, V4, V2 down to V1 (Motter, 1993, 1994; Moran & Desimone, 1985; Luck, Chelazzi, Hillyard, & Desimone, 1997; Colby, Duhamel, & Goldberg, 1996; Treue & Maunsell, 1996, 1999; Fuster, 1990; Desimone, 1996). There is evidence that the origin of attentional signals is in prefrontal cortex (Tomita et al., 1999). The generally low ring rates in higher areas suggest that the attentional control circuitry recruits or derecruits feedback networks with neurons in lower areas by driving neurons above or below ring threshold. Several experiments provide evidence of the involvement of inter-areal feedback in the attentional selection of low-contrast stimuli. In V4, responses of neurons to very low-contrast stimuli are not increased when they are attended in the presence of high-contrast distractors: Response enhancement is possible only above a critical contrast of about 5% (Reynolds, Pasternak, & Desimone, 2000). Also, in the experiments of De Weerd, Peralta, Desimone, and Ungerleider (1999), it was found that restricted lesions of areas V4 and TEO result in monkeys being unable to report the orientation of low-contrast gratings in the presence of high-contrast distractors. However, the monkeys’ perceptual performance for low-contrast stimuli was almost unchanged when the distractors were not present. This suggests that the circuits in TEO–V4 are highly involved in attentional selection of low-contrast stimuli rather than in reading out of stimulus orientation. Previously, we have noted that if attentional stimulus selection is induced by an excitatory top-down bias, then this bias necessarily also leads to a bias in the readout of the selected stimulus (Hahnloser et al., 1999). Here, we resolve this selection / readout dilemma by considering attentional inputs that are just strong enough to recruit neurons in the higher area and their
Recruitment of Inter-Areal Recurrent Networks
1671
feedback loops with neurons in the lower area, but without providing for an additional input bias. Stimulus selection is achieved by recruiting more feedback loops for stimuli that are to be attended than for distractor stimuli. We cast our network as a simple model of two recurrently connected areas, such as MST–MT, V5–V2, or TEO–V4. We explore the computational principles that could underlie the recruitment of feedback between large and small receptive eld neurons by simulating physiological experiments in which multiple stimuli appear inside the receptive eld of a large eld neuron. We study the limits within which attentional selection is possible when the attended stimulus is of low contrast and at a small distance from distractor stimuli. By varying the amount of recruitment and the size of the attended stimulus, we explore the accuracy of readout in a noisy environment. 2 Network Equations
The ring rates of E excitatory neurons in the lower area are denoted by Mx (x D 1, . . . , E), those of N excitatory neurons in the higher area by Pi (i D 1, . . . , N) and those of I inhibitory neurons in the lower area by Iy (y D 1, . . . , I) (see Figure 1a). The indices x and y stand for a one-dimensional topography of the lower area, and the index i stands for a not necessarily topographic labeling of neurons in the higher area. The equations describing the evolution of ring rates are given by: " # E X P Pi D ¡Pi C pi C aF Mx cos(dx ¡ Âi ) C ¡ t xD 1
2
P x D ¡M x C 4mx C aB M 2 IPx D ¡Ix C 4aI
N X iD 1
N X iD 1
(2.1) C
Pi cos(dx ¡ Âi ) C ¡ b
Pi cos( yx ¡ Âi ) C ¡ b I
3
I X y D1
3 I X yD 1
Iy 5
Iy 5
(2.2) C
(2.3) C
Here [ f ] C D max (0, f ) denotes rectication and ensures positivity of ring rates. The receptive eld centers dx of neurons Mx are regularly spaced, dx D x¡1 . Similarly, the receptive eld centers of neurons are given by y dmax E¡1 Iy y D
ymax y¡1 . The inter-areal connections are purely excitatory. Their strength I¡1 decays as a function of receptive eld separation z, according to cos(z) C D [cos(z) ] C . There is uniform inhibition in the lower area (the assumption of uniformity is a simplication that could be relaxed; see Wersing, Beyn, & Ritter, 2001). The parameters aF , aB , and aI determine the strength of excitation, and b and bI the strength of inhibition.
1672
Richard H. R. Hahnloser, Rodney J. Douglas, and Klaus Hepp
Figure 1: Attending to one of two stimuli moving in antiphase. (a) Schematic of network architecture of excitatory neurons in higher area (P) and excitatory and inhibitory neurons in lower area (M and I). The Greek letters denote the synaptic coupling strengths. (b) Two moving visual stimuli are placed in the receptive eld of neuron P2 in area MST (thick circle, schematic). Synaptic weights of the three MST neurons with MT neurons are shown on top. (c) The attentional selection of the left stimulus (indicated by the rectangle) is modeled by recruiting pointer neurons P1 and P2 . Their responses are shown by the solid and the dash-dotted lines, correlating with the upward movement of the left stimulus. p1 D p2 D 1, p3 D 0. (d) This time, pointer neurons P2 and P3 are recruited. Their responses (solid and dashed lines) correlate with the upward movement of the right stimulus. E D 320, dmax D p , N D 3, I D 32, ymax D p , Â1 D 0, Â2 D p / 2 Â3 D p , aF D .5, aB D 2, aI D 2, b D 60.07, bI D 60, a D 6, t D 1, hup D 1, hdown D .1.
Recruitment of Inter-Areal Recurrent Networks
1673
For the simulations, the visual input mx contains either one or two lo± (dx ¡ r) ) C if |dx ¡ r| · a / 2 and calized stimuli of the form mx D h cos( 180 a mx D 0 otherwise. Here, h corresponds to the contrast of a stimulus (or its luminosity), a to the stimulus width (in degrees), and r to its retinal location, 0± < r < dmax . Neurons Pi have nonzero ring thresholds t > 0 that express the difculty of driving neurons in higher areas by visual stimulation alone. The attentional inputs pi to neurons in the higher area are set to either zero (no recruitment) or t (recruitment). Appendix A describes a simple method for selecting appropriate values for the ve coupling parameters aF , aB , ai , b, and b I . 3 Example of Recruitment
Typically, neurons in various higher visual areas are able to respond to just the attended stimulus in their receptive eld, ltering out nonattended stimuli (Moran & Desimone, 1985; Desimone, 1998; Reynolds, Chelazzi, & Desimone, 1999). For example, Treue and Maunsell recorded from neurons in areas MT and MST of alert monkeys during a visual attention task involving two moving stimuli (Treue & Maunsell, 1996, 1999). The monkeys were instructed to attend to one of them and to respond quickly to a change of speed. Both stimuli fell inside the receptive eld of a recorded neuron, and alternately, one stimulus moved in the neuron’s preferred direction while the other moved in the antipreferred direction. They found that most of the time, the neuronal response correlated strongly with the direction of motion of the attended stimulus, not the distractor stimulus. And when monkeys attended to a stimulus outside the receptive eld, neuronal response was suppressed, not correlating with the preferred movement direction of either of the two stimuli in the receptive eld. We chose these experiments by Treue and Maunsell to illustrate the operation of recruitment (we do not provide a complete model for the MT–MST interactions). We simulated the response behavior of N D 3 motion-selective neurons P1 , P2 , and P 3 in area MST (the higher area) and E D 320 motiondirection selective neurons Mx in area MT (the lower area). Receptive eld centers in MST are Â1 D 0± , Â2 D dmax / 2, and Â3 D dmax D 180± . There are two vertically moving dots. Neuron P2 sees both dots in its receptive eld, whereas neurons P1 and P3 each see only one dot. The two dots oscillate in antiparallel directions to each other (see Figure 1b). Because we assume that the one-dimensional map in MT encodes only the horizontal dimension, “vertical movement” was simulated by contrast changes. We assume that all MT and MST neurons have an equal upward direction preference, simply modeled by setting the contrast of the “downward-moving dot” to one-tenth of the contrast for the “upward-moving dot” (in other words, the contrasts of the two dots ipped back and forth between two values). The common threshold t of the neurons in MST is large. In the model, MST neurons are activated by visual stimulation only when they are also
1674
Richard H. R. Hahnloser, Rodney J. Douglas, and Klaus Hepp
recruited by attentional input, pi D t. But only those neurons are recruited whose receptive elds overlap with the current focus of attention. For example, when the monkey attends to the left stimulus, neurons P1 and P2 are recruited, in which case neuron P2 responds mainly during the upward movement of the dot on the left (see Figure 1c). Similarly, the response of P2 is bound to the upward movement of the right dot when the monkey attends to the right and neurons P2 and P3 are recruited (see Figure 1d). Neurons P1 and P3 are active only when the attended stimulus is the one in their receptive eld, in which case they respond to its upward movement. When the monkey attends to the other stimulus outside their receptive eld, they remain silent. In analogy with the Treue and Maunsell experiments, attention causes stimulus competition beyond receptive eld boundaries (neurons P1 and P3 ), as well as within receptive eld boundaries (neuron P2 ). The explanation of why in Figure 1 the activity of neuron P 2 correlates with the movement direction of the attended stimulus is quite simple: the recruitment of either neuron P1 or P3 contributes feedback amplication, enhancing responses in MT to the left or right stimulus. And because responses are enhanced in MT, responses will be enhanced in MST as well. This explains why neuron P2 —receiving the same attentional input p2 D t in both Figures 1c and 1d—can have a response that is biased according to which fellow MST neuron is recruited. 4 Loss of Attentional Selection for Low-Contrast and Nearby Stimuli
In this section, we analyze the conditions under which attentional recruitment enables persistent selection of a behaviorally relevant stimulus in the presence of distractor stimuli. By “persistent selection,” we mean that neural responses are locked to the selected stimulus and persist when distractor stimuli change (in other words, neural responses are as if the distractors were not present). We explore the sensitivity of persistent selection to various parameters such as stimulus contrast (relative to distractor contrast) and stimulus location (relative to distractor location). We will consider only a one-dimensional network and nearby stimuli that fall between r D 0± and r D 90± . Consequently, we restrict the map in the lower area to dmax D 90± and ymax D 90± . Because we are mainly interested in the effects of recruitment, half of the neurons in the higher area have receptive eld centers at Â1i D 0± and the other half at Â2i D 90± (i D 1, . . . , N / 2). By discretizing the receptive eld centers to just two values separated by 90 degrees, the population activity in the higher area gets a simple interpretation: the activity vector formed by pairs P i D ( Pi1 , Pi2 ) of neurons forms a vector whose direction indicates the center of activity in the lower area (and thus the location of the selected stimulus). This population vector property stems from the geometrical fact that there are sine and cosine synaptic connection proles between the two areas (which in turn is based
Recruitment of Inter-Areal Recurrent Networks
1675
on the equality cos(a ¡ 90± ) D sin(a)). As in our previous work, the activity vectors P i in the higher area shall be referred to as pointers (Hahnloser et al., 1999). Recruiting pointers that share receptive eld centers is mathematically equivalent to changing the synaptic weights aF , aB , and aI made by a single pointer. In Figure 2a, a stimulus and a distractor of equal contrast are presented to the network at three different separations. The steady response in the higher area is read out as the pointer angle, ³P i ´ P c D arctan Pi 2i , i P1 and is plotted as a function of the number N C of recruited pointers (N C is dened by p1k D p2k D t for k · N C and p1k D p2k D 0 for k > N C ). All recruited pointers are initialized so as to express a preattentive bias to the left, P i (0) D (2, 0) . This initialization tends to induce attentional selection of the stimulus on the left. When stimulus and distractor are directly adjacent to each other, then persistent selection arises only for about N C > 20. Persistent selection requires fewer pointers the farther apart the stimuli are. In a similar way as for stimulus separation, the selection of a low-contrast stimulus is possible only within limits. Figure 2b shows a diagram in which we plot (as a function of N C ) the minimal relative contrast of a stimulus that still permits its selection by attentional recruitment. Relative contrast is dened as the contrast of the stimulus divided by the contrast of the distractor. When the stimulus is close to the distractor and 32 pointers are recruited, then persistent selection is possible only if its contrast is at least 20% of the distractor contrast. However, when the stimulus is farther away, then many fewer pointers are required for persistent selection at the same relative contrast. In other words, for a xed number of recruited pointers, the minimal stimulus contrast allowing for persistent selection increases as the stimulus and the distractor move closer together. We have analyzed the sensitivity of these results to the strength of synaptic weights. Interestingly, we found that altering the strength of the excitatory feedback (e.g., decreasing aB by 4%) does not have a noticeable inuence on the distance sensitivity in Figure 2a. However, this decrease of excitatory feedback has a dramatic effect on the contrast sensitivity in Figure 2b. The reduced strength of excitatory feedback for the intermediate stimulus-distractor separation results in a highly reduced performance for selecting low-contrast stimuli. In the next section, we show that the reason for the decreased selectability is that a small reduction in feedback can cause the net effect of recruitment on neurons in the lower area to be inhibitory rather than excitatory, without affecting the strength of competition.
1676
Richard H. R. Hahnloser, Rodney J. Douglas, and Klaus Hepp
Figure 2: Distance and contrast dependence of attentional selection. (a) The effective pointer angle c is plotted as a function of the number of recruited pointers. Two identical stimuli are presented at three different interstimulus separations (insets above, 17, 34, and 51 degrees). The initial conditions of pointer neurons P i (0) D (2, 0) tend to cause selection of the left stimulus, the right stimulus representing a distractor. The closer the two stimuli are, the more pointers are required for persistent selection. For the rst case, where the stimuli are directly adjacent to each other (solid line), three snapshots of steady map activity are shown, corresponding to interpolation (left), partial selection (middle), and persistent selection (right). (b) The minimal relative contrast allowing for persistent selection is shown as a function of the number N C of recruited pointers. The curves were determined by slowly decreasing the contrast of the selected stimulus until c starts to deviate from the center of the selected stimulus. Again, curves are plotted for three stimulus-distractor separations. The solid, dashed, and dash-dotted curves correspond to aB D 0.625. The dashed curve labeled II corresponds to aB D 0.6—to be compared with the dashed curve labeled I. b D 3.755, bI D 60, aF D 0.1, aI D 10, I D 32, N D 64, E D 320. 5 Winner-Take-All and Attentional Enhancement of Responses
Here we compare inter-areal feedback and winner-take-all (WTA) mechanisms. We show that recruitment has the effect of changing the strength of the WTA mechanism. A uniform input to the lower area can be viewed as a setting in which each neuron has the same chances of being activated at a steady state (this is due to translational invariance of feedback, sin2 a C cos2 a D 1; see also Hahnloser et al., 1999). The winning neurons (the ones that are activated) are determined by the initial conditions of the dynamics. In Figure 3a, for xed initial conditions, we see that a localized response to uniform stimulation emerges. The recruitment of many pointers leads to a substantial
Recruitment of Inter-Areal Recurrent Networks
1677
narrowing of the activity prole. In Figure 3b, the response width w of the steady response prole is plotted as a function of the number of recruited pointers N C . The circles are simulation results, and the full line corresponds to an analytical calculation done in appendix B. It can be seen that w is a monotonically decreasing function of N C . This behavior can be interpreted as a WTA mechanism whose softness or hardness is modulated by the number of recruited pointers. The more of them are recruited, the harder is the WTA mechanism. As an interesting limit to this recruitment-induced strengthening of WTA mechanisms, in appendix C, we calculate the hard WTA limit, in which only one neuron Mx can be active at a steady state. This limit has similarities to a maximum operation that has been suggested to be of relevance for object recognition (Riesenhuber & Poggio, 1999). We nd that for a hard WTA, the number N hard of pointers that have to be recruited grows quadratically in E. C This scaling law suggests that there are not enough neurons to achieve an exact maximum operation by recruitment, but that at best, an approximation to the maximum operation is possible. As in the previous section, we have analyzed the effect of reducing the strength of excitatory feedback. In Figure 3c, we reduced aB by 4%. In this case, recruiting pointers does not enhance responses as it did in Figure 3a, but it suppresses responses. In appendix A, we show that the polarity of signal gain depends on the balance between excitatory and inhibitory feedback gain. Signal enhancement occurs if the gain of excitatory feedback set by aF and aB is larger than the gain of inhibitory feedback set by aF , aI , bI , and b. Interestingly, although the signal gains in Figures 3a and 3c are different, the response widths are not. Hence, the hardness of the WTA is insensitive to the exact tuning of synaptic strength, as was the stimulus-separation sensitivity of attentional selection in Figure 2a. 6 Attentionally Controlled Noise Suppression
Psychophysical studies and electrophysiological recordings show that visual attention can enhance the discriminative performance of macaque monkeys (Spitzer, Desimone, & Moran, 1988; Lu & Dosher, 1998) and the discriminative responses of single neurons (Spitzer et al., 1988; McAdams & Maunsell, 1999). Signal discrimination is in many ways equivalent to signal estimation in the presence of external noise. Population vectors (low-dimensional representations of the activity of many neurons) are possible signal estimators. They can achieve an unbiased readout of sensory input signals if the input noise is uncorrelated between neurons (Seung & Sompolinsky, 1993). In other words, the mean readout of a population vector over many stimulus repetitions is equal to the value of the sensory signal. This a highly desirable feature of any readout method. However, population vectors do not always have a good performance in averaging out noise. For example, if the neurons supporting the population
1678
Richard H. R. Hahnloser, Rodney J. Douglas, and Klaus Hepp
Figure 3: Transition from soft to hard winner-take-all. (a) Steady map response (full and dashed line) to uniform input (dash-dotted line). Attentional recruitment (from 1 to 32 pointers) leads to a sharpened response prole of similar total activity, but with enhanced peak response. The gain of excitatory feedback is approximatively equal to that of the inhibitory feedback, aB D 0.625 (see appendix A). (c) The gain of excitatory feedback is smaller than in a, aB D 0.6. Recruitment leads to a strong down modulation of the population response. (b) The response width decays monotonically with the number of recruited pointers (e.g., the strength of the WTA competition increases). For a given number of recruited pointers, the response width is invariant to small changes in aB (not shown). The circles represent simulations of equations 2.1 to 2.3 and the full line corresponds to a plot of equation B.2. E D 320, I D 32, N D 64. b D 3.755, bI D 60, aF D 0.1, aI D 10.
vector have very narrow tuning curves, then the mean squared error of the readout tends to become very large. In general, any unbiased estimator should have the smallest possible variance, because the variance determines how well two similar stimuli can be discriminated. Pouget, Zhang, Deneve, and Latham (1998) have examined the advantage of recurrence in the problem of large variability of population vectors.
Recruitment of Inter-Areal Recurrent Networks
1679
They found that lateral excitatory connections in cortex can substantially reduce the uncorrelated noise between neurons. In this way, the noisy input to a map of recurrently connected neurons is restored to a steady-state activity that can then be read out by a population vector with near-optimal accuracy. However, the near-optimal readout is achieved only if the stimulus width closely matches the intrinsic tuning width of synaptic connections. Here we show that a much broader range of near-optimality can be achieved when inter-areal feedback loops are recruited according to some prior knowledge of stimulus width. In our network, the inter-areal feedback combines desirable features of both of the above readout methods. That is, pointers can read out the activity of a map by an unbiased population vector and provide the necessary feedback to cancel uncorrelated noise. Thus, it is possible to have the best of both worlds. We set the task of the network to extract the location of a stimulus fx of variable width a, where fx (r ) D cos( pa (dx ¡ r) ) if |dx ¡ r| · a / 2 and fx D 0 otherwise (r is the location of the stimulus). We assume that there is prior knowledge available about the stimulus width a. The noise is modeled by adding to fx some random number g taken from a gaussian distribution with zero mean and xed variance s 2 . In this way, the input probability density mx is given by P (mx |r) D p
1 2p s 2
e
¡
(mx ¡ fx (r)) 2 2s 2
.
(6.1)
Figure 4a shows the response of the network to this noisy stimulus under conditions of both few and many recruited pointers. Similar results hold if the noise is Poissonian rather than gaussian. We have computed the mean and standard deviation of the readout c for 5000 presentations of the same stimulus at r D 45± , but with different noiseg. The mean of c converged toward r, in agreement with unbiased estimation. In Figure 4b, we have plotted the standard deviation S (c ) as a function of the number of recruited pointers. For a broad stimulus (dashed line), the standard deviation is minimal when the number of recruited pointers falls between 3 and 5. On the other hand, when the stimulus is narrow (full line), the readout is optimal when about 7 to 14 pointers are recruited. Thus, we nd that whichever number of pointers should be recruited depends critically on the stimulus width. Figure 4c shows the dependence of the standard deviation S (c ) on stimulus width, for 1, 4, and 32 recruited pointers. The stimulus location was held xed at r D 45± , and its width a was varied in steps of 4.5 degrees from 4.5 to 81 degrees. The standard deviation was computed using n D 1000 presentations of the stimulus for each width. To get a sense of how large these standard deviations of the pointer estimates are in absolute terms, we have compared them to the minimal
1680
Richard H. R. Hahnloser, Rodney J. Douglas, and Klaus Hepp
Figure 4: Recruiting feedback for suppressing noise. (a) A noisy stimulus of width a D 54± is indicated by the dotted line. The steady activity in the lower area is shown for the case of 32 recruited pointers (solid line) and for the case of one recruited pointer (dashed line). s 2 D .04, E D 90. (b) Standard deviations of pointer readout as a function of the number of recruited pointers. A relatively broad stimulus a D 45± results in a minimal standard deviation for about three to ve recruited pointers (dashed line). The lower bound as given by equation 6.3 is shown by the ne dashed line. For a slightly narrower stimulus, a D 34± , the standard deviation is minimal for 6 to 15 recruited pointers. The ne full line shows the lower bound for this stimulus width. For both stimuli, the optimal pointer estimates deviate by about 10% from the theoretical minimum. s 2 D .04. (c) Standard deviations as a function of stimulus width a. Narrow stimuli (in region A) require strong feedback (32 pointers). Broader stimuli (in regions B and C) require weaker feedback (4 and 1 pointer, respectively). The ne dashed line represents the performance of the population vector estimate, equation 6.4. Its performance is similar to the 1-pointer case. s 2 D .04. (d) As suggested in b, for every number of recruited pointers, there is a different stimulus width abest , for which the standard deviation of readout is minimal (thick full line: s 2 D .04 and ne full line: s 2 D .01). The dashed line shows the response width w to uniform input. E D 80, N D 40, I D 20. aF D 0.4, aB D 0.1, aI D 2.5, bI D 24, b D 0.9656.
Recruitment of Inter-Areal Recurrent Networks
1681
achievable standard deviation S (c opt ) achievable by any readout method. For a large network (E À 1) , this minimum is given by the Crame r–Rao bound (Cover & Thomas, 1991), dened by the inverse of the square root of the Fisher information: S (c opt ) D qP
1 d2 ( x h¡ dr2 P mx |r) i
.
(6.2)
A calculation shows that for large E, r S (c opt ) D s
a , pE
(6.3)
which corresponds to the ne line in Figure 4c. Hence, an optimal readout has an error that is proportional to the square root of the stimulus width and is inversely proportional to the square root of the number of neurons. It is illustrative to compare the pointer readout P to the readout achieved by a population vector dened by v D (v1 , v2 ) D x mx (cosdx , sin dx ) . We have calculated the standard deviation for large E, under the approximation that p uctuations are small, in which case only the component z D 1 / 2(v2 ¡ v1 ) of the population vector orthogonal to the stimulus direction r matters: S (c pop ) ’
(p 2 ¡ a2 ) S (z ) Ds |hv i| 4a cos(a / 2)
r
p ¡2 . 2p E
(6.4)
This standard deviation is shown as a ne dashed line in Figure 4c. It tends to diverge for narrow stimuli but approaches the optimal standard deviation as the stimulus width approaches 90 degrees (at this angle, there is a mathematical equivalence between the population vector and the maximum likelihood estimates). We have found that for any stimulus width, there are a number of recruited pointers N C , for which the standard deviation of readout is surprisingly close to the Crame r–Rao bound (see Figure 4c). For small N C (dashdotted line), the readout has a large standard deviation for narrow stimuli and decreases as stimuli become broader. This behavior is similar to that of the population vector and conrms the intuitive notion that when feedback is weak, pointers are nothing but population vectors. As more pointers are recruited (dashed line), the standard deviation is nonmonotonic and has a minimum at a stimulus width of about a D 45± . Finally, when many pointers are recruited (full line), then the minimal standard deviation is achieved for even narrower stimuli, at about a D 25± . In Figure 4d, the full line corresponds to the best stimulus width abest ( N C ) as a function of the number of recruited pointers (abest is the width at which the standard deviation of the readout is smallest). Again, strong recruitment
1682
Richard H. R. Hahnloser, Rodney J. Douglas, and Klaus Hepp
is better for narrow stimuli, and weak recruitment is better for broad stimuli. As can be seen, abest ( N C ) is not very sensitive to the variance of the noise. Furthermore, its dependence on N C is similar to the response width w to a uniform input without noise, shown by the dashed line (see also Figure 3a). w is larger than abest by about 30 degrees but decreases in a similar way. Hence, the strength of feedback (N C ) is optimal when its implicit response width (dened by uniform input) is slightly larger than the stimulus width to be encoded. To summarize, Figure 4 makes the point that attentional recruitment (based on prior information about stimulus width) can yield a substantial improvement in signal estimation in comparison to locally recurrent networks (corresponding to the case where the number of recruited pointers is xed). Increased recruitment is needed for small stimuli. Also, increased recruitment is needed for very noisy environments. However, the dependence of optimal recruitment is less sensitive on noise variance than on stimulus size. We expect that a similar improvement also holds for two-dimensional (2D) spatial receptive elds (however, unlike in the one-dimensional case, narrow 2D stimuli have an equal discriminability as broad 2D stimuli; there, the CramÂer–Rao bound is independent of stimulus width; Zhang & Sejnowski, 1999). 7 Discussion
Recruitment of neurons and their feedback loops has been studied previously in a different context. In a model of the oculomotor integrator (Seung, Lee, Reis, & Tank, 2000), recruitment was postulated to serve the role of maintaining a precise tuning of feedback amplication, compensating for saturation nonlinearity. Here, recruitment is postulated in the context of inter-areal networks to accentuate multistability. By increasing both the excitatory and inhibitory gain of inter-areal feedback, persistent attentional selection of low-contrast and nearby stimuli becomes possible. In agreement with experiments, we have found a distractor-dependent limit for the contrast of a stimulus, below which attentional selection is impossible (see Figure 2b). Our results predict that besides the contrast dependence of attentional selection, there should be an additional dependence on stimulus separation (see Figure 2a), for which there is only little experimental evidence so far (De Weerd et al., 1999). In most electrophysiological experiments, selection of a stimulus enhances visual responsiveness at attended locations (only a few experiments have shown suppressed responses at attended locations; Motter, 1993). The results in Figure 2 suggest that this response enhancement is due to the general dominance of excitatory inter-areal feedback gain over the inhibitory gain. If inter-areal circuits were wired to reduce responses at attended locations, then the selectability of low-contrast stimuli would be impaired in comparison to circuits wired to enhance responses at attended locations.
Recruitment of Inter-Areal Recurrent Networks
1683
Thus, in order to be able to attend to low-contrast stimuli, attentional effects should be enhancing rather than suppressing. Support for a dominance of excitatory gain over inhibitory gain comes from anatomical studies (Johnson & Burkhalter, 1997) and a more recent nding, where after cooling of area MT, 33% of the neurons in area V2 showed a signicant decrease in response to visual stimulation, whereas only 6% showed an increase (HupÂe et al., 1998). We have shown that attentional recruitment can give cortical processing the ability to adjust signal processing to achieve near-optimal noise reduction for a broad range of stimulus sizes (see Figure 4). This ability generalizes previous results, where recurrent connections have been shown to be near-optimal for only a limited range of stimulus sizes (Deneve, Latham, & Pouget, 1999). However, we point out that the question remains how cortex would be able to recruit the appropriate amount of feedback, given some noisy environment and stimulus to be expected. We do not provide a solution for this problem, but we imagine that the appropriate computation involving the use of some prior information to deduce recruitment level is done in prefrontal cortex. Our results on noise reduction are comparable to psychophysical studies where attention has been suggested to activate WTA competition between visual lters (Lee, Itti, Koch, & Braun, 1999). Lee et al. have tted their psychophysical data from orientation discrimination tasks to a model where there is a divisive normalization between simple cortical visual lters. They found best agreement between model and data when the effect of attention was to change the exponents of the divisive normalization between lters rather than any other parameter of the lter interactions. Because the exponents determine the strength of competition between cortical lters, their result is consistent with our nding that attentional recruitment has the effect of hardening WTA competition. Appendix A: Parameter Selection
In the model equations 2.1 to 2.3, there are ve coupling constants: aF , aB , b, aI , and bI . Here we show how to choose these constants as a function of the feedback gain they result in. In the following, excitatory gain means the loop gain of the excitatory feedback to the map, via neurons P i ; and inhibitory gain means the loop gain of the effective inhibitory feedback to the map, via neurons I . The total gain of feedback is the sum of these two loop gains. The relative gain of feedback is important for stability, because in our network, the eigenvectors of the positive and negative feedback loops corresponding to the largest and smallest eigenvalues, respectively, are similar to each other (the excitatory connections are broad, comparable to global inhibitory connections). First, choose the strength of localizing inhibition bI according to the number nI of inhibitory neurons to be active at a steady state (it can be shown
1684
Richard H. R. Hahnloser, Rodney J. Douglas, and Klaus Hepp
that nI is independent of the visual input to the map). Assume that the different pointers P i get the same attentional input and are thus parallel to each other, forming a single “effective pointer.” Denote the common steady pointer angle by c D arctan(Pi2 / Pi1 ). In this case, the inhibitory neurons Ix P receive a feedforward input prole aI P cos(c ¡ yx ) , where P D i kP i k is the length of the effective pointer. nI is determined by the steady state of equation 2.3 (e.g., a symmetric prole centered at c ). The cut-off relation Iz D 0 determines the border yz of this prole. In terms of the angular sepp aration between inhibitory neurons, c ¡ yz D nI yO / 2 where yO D 2(I¡1) , we get aI P cos(nI yO / 2) D bI S,
(A.1)
P where S D x Ix is the total activity of inhibitory neurons (angles are now measured in radians). In analogy to equation B.2, by integrating the steady state of equation 2.3 over x in the large N (continuum) limit, we can calculate nI to third-order approximation, using the cut-off relation, equation A.1: Á nI D 2
3 2bI y22
! 1/ 3 .
(A.2)
Notice that the width nI depends on only the strength of recurrent inhibition bI and not on aI or the length P of the effective pointer. For an inhibitory map of I D 32 neurons, if we choose b I D 60 in equation A.2, we obtain nI D 4 active inhibitory neurons at a steady state. If smaller values for bI are chosen, then the inhibitory activity prole tends to be broader. But a broad inhibitory activity prole is not very desirable for our simulations, since it can result in unwanted boundary effects. The values of the other parameters are determined by separating the inhibitory from the excitatory feedback loops. First, choose some values for aF and aB such that their product is small. In this case, the excitatory feedback loop mediated by a single pointer is weak (having weak pointers means that the increments in feedback strength that arise by recruiting pointers are small and can be precisely controlled). The activities of excitatory neurons in the lower and the higher area are typically of equal magnitude if aF is smaller than aB . By choosing an inversely proportional connection strength from pointer onto inhibitory neurons, aI D 1 / aF , the amplication from map onto pointers cancels the reduction from pointers onto inhibitory neurons, just as if the map fed onto inhibitory neurons directly and with unitary weights. Because both excitatory and inhibitory feedback are mediated by pointers, it is convenient to express their gains in terms of the length P of the effective pointer. In this view, the excitatory gain GE onto the map is GE D aB P.
(A.3)
Recruitment of Inter-Areal Recurrent Networks
1685
The inhibitory gain GI onto the map is GI D ¡bS. Using S from equation A.1 we nd GI D ¡
aI b P cos(nI / 2dI ) . bI
(A.4)
In our simulation, we have chosen b such that the sum of excitatory and inhibitory gain is equal to zero, GE C GI D 0, from which we get bD
aF aB bI . cos(nIdI / 2)
(A.5)
In order to obtain numerical values for b, we replace nI from equation A.2. As expected, with this choice of parameters, the gain of cortical amplication is about unity in Figures 2 to 4 and is independent of the number of pointers recruited. Notice that the previous calculation of feedback gain is valid only for large pointer map networks, because it is based on a continuum approximation. Nevertheless, these parameters yield stable attractor dynamics for all inputs. But the network is very close to the limit beyond which stability breaks down. For example, for the parameter settings of Figure 3, decreasing b in equation A.5 by only 2% results in unstable dynamics with unbounded amplication. Hence, although we want the excitatory gain to be at least as large as the inhibitory gain, we are limited in the amount by which they can differ from each other. This result is comparable to a previous calculation, where inhibition was instantaneous and we were able to construct a Lyapunov function if the excitatory gain was not larger than the inhibitory gain by more than 1 / E (Hahnloser et al., 1999). The fact that the recurrent inhibition in equation 2.2 is not instantaneous but mediated by separate inhibitory neurons can sometimes lead to oscillations (Li & Dayan, 1999). However, it does not lead to oscillations if the time constant of equation 2.3 is small (paradoxically, the time constant is small if bI is large—if only a small number of inhibitory neurons are active). Appendix B: Soft Winner-Take-All
As an approximation to the limit, in which the number of neurons E in the lower area becomes innite, the steady state of equations 2.2 and 2.3 can be transformed into the second-order differential equation M00x D ¡Mx C c (where 0 denotes derivation with respect to x). For uniform input, the solution is a cosine-shaped prole that can be centered at an (almost) arbitrary angle Â: Mx D H[cos ( ¡dx ) ] C C c, where ¡w / 2 · dx ·  C w / 2. The amplitude H, the width w, and the offset c of the prole are unknowns that can be R ÂCw 2 inferred from M § w/ 2 D 0, M (  ) D H and ¡w//2 Mx dx D 2H sin(w / 2) C cw. This leads to p (B.1) w ¡ sin w D , N C aF aB ( E ¡ 1)
1686
Richard H. R. Hahnloser, Rodney J. Douglas, and Klaus Hepp
where N C is the number of recruited pointers. To third-order approximation in w, we nd ³ w’
6p N C aF aB (E ¡ 1)
´ 1/ 3 .
(B.2)
The excellent match of equation B.2 with the simulation results can be seen in Figure 3b. Surprisingly, based on equation B.1, the width w does not depend on the strength of inhibition given by b and bI , nor does it depend on the number I of inhibitory neurons. In fact, the width depends on only the strengths of excitation aF and aB . Increasing the strength of inhibition does not change the width of the prole, only its amplitude H. The sharpening of activity with increasing number of recruited pointers can be understood in terms of the mathematical principle of forbidden sets (Hahnloser, Sarpeshkar, Mahowald, Douglas, & Seung, 2000). This principle was derived for symmetric linear threshold networks. The principle says that some sets of map neurons cannot be simultaneously active at a steady state, because their connectivity expresses “forbidden” or unstable differential modes (differential modes are eigenvectors with both negative and positive components). Because all unstable modes are differential, at least one map neuron will eventually fall below the rectication nonlinearity, and so the largest eigenvalue of the feedback will decrease. In this way, map neurons are progressively inactivated by the network dynamics. The process halts when the largest eigenvalue becomes smaller than one, where stability is achieved and a stable activity pattern can be formed (the set of active map neurons becomes “permitted”). By recognizing that the number N C of recruited pointers has a multiplicative inuence on eigenvalues of inter-areal feedback (recruiting twice as many pointers results in a doubling of feedback gain), we arrive at a simple understanding of what causes the WTA mechanism. The more pointers are recruited, the more map neurons have to be inactivated by the network dynamics in order for the active neurons to form a permitted set. Appendix C: Hard Winner-Take-All
If the number of attentionally recruited neurons in the higher area increases progressively and all of these neurons participate in similar feedback loops with neurons in the lower area, then at a certain point, the feedback will be so strong that only a single neuron in the lower area can be active at a steady state. Under these conditions, the network implements a hard winner-takeall mechanism. Here we calculate exactly how many pointers are required to achieve this regime. Assume that the parameters are selected according to appendix A. We study the steady states of equations 2.1 to 2.3, to establish that no neuron other than neuron M s can be active at a steady state (the choice of s is arbitrary). In other words, denoting steady states by underlin-
Recruitment of Inter-Areal Recurrent Networks
1687
6 s holds true for all stationary inputs ing, we require that Ms0 D 0, where s0 D mx . We proceed by assuming that there are N C recruited pointers and that neurons are at a steady state in which only a single neuron M s is active in the lower area. In this case, the length of the effective pointer is P D N C aI Ms . Using this expression with the relationship between parameters as given in appendix A, in particular equation A.1, we nd that
M s D ms .
(C.1)
There is an amplication gain of exactly one (this intermediate result is surprisingly consistent with the continuum assumption made in appendix A). In order for Ms to be the only activated neuron, the neurons Ms §1 adjacent to Ms should not be activated, even in the most extreme case where their input is equally large, ms C 1 D ms (notice that if ms C 1 were larger than ms , then we might as well assume that Ms C 1 is the single active neuron in steady state, which leads us back to the same argument). Thus, neurons that are not nearest neighbors of Ms do not need to be considered; they have a smaller probability of being activated than nearest neighbors, because the excitatory feedback loops via pointers decay with distance in the lower area (Hahnloser et al., 1999). A simple calculation of the steady-state M s C 1 yields O ¡ 1), M s C 1 D ms C 1 C N C aF aB Ms (cos(d)
(C.2)
p where dO D 2(E¡1) . We determine the number of recruited pointers N hard C beyond which the WTA is hard by using the constraint Ms C 1 D 0. We nd that for
¸ Nhard C
1 O ) aF aB (1 ¡ cos(d)
’
( E ¡ 1) 2 , aF aB
(C.3)
no neuron other than M s is active at a steady state. We see that the number of neurons Nhard in the higher area to be recruited increases quadratically C with the number of neurons in the lower area, which suggests that this hard WTA limit is an interesting computational limit rather than a possible tool that can be used by inter-areal circuits as studied here. Acknowledgments
We acknowledge comments on the manuscript by Martin Giese and the support of the Swiss National Science Foundation and the Korber ¨ Foundation.
1688
Richard H. R. Hahnloser, Rodney J. Douglas, and Klaus Hepp
References Colby, C., Duhamel, J.-R., & Goldberg, M. (1996). Visual, presaccadic, and cognitive activation of single neurons in monkey lateral intraparietal area. J. Neurophysiol., 76(5), 2841–2852. Cover, T., & Thomas, A. (1991). Information theory. New York: Wiley. Deneve, S., Latham, P., & Pouget, A. (1999). Reading population codes: A neural implementation of ideal observers. Nature Neuroscience, 2(8), 740–745. Desimone, R. (1996). Neural mechanisms for visual memory and their role in attention. Proc. Natl. Acad. Sci. USA, 93(24), 13494–13499. Desimone, R. (1998). Visual attention mediated by biased competition in extrastriate visual cortex. Philosophical Transactions of the Royal Society (London), B Biological Sciences, 353, 1245–1255. De Weerd, P., Peralta, M., Desimone, R., & Ungerleider, L. (1999). Loss of attentional stimulus selection after extrastriate cortical lesions in macaques. Nature Neuroscience, 2(8), 753–758. Fuster, J. (1990).Inferotemporal units in selective visual attention and short-term memory. J. Neurophysiol., 64(3), 681–697. Hahnloser, R. H., Douglas, R. J., Mahowald, M., & Hepp, K. (1999). Feedback interactions between neuronal pointers and maps for attentional processing. Nature Neuroscience, 2(8), 746–752. Hahnloser, R. H., Sarpeshkar, R., Mahowald, M., Douglas, R. J., & Seung, S. (2000). Digital selection and analog amplication coexist in a silicon circuit inspired by cortex. Nature, 405, 947–951. Hubel, D., & Wiesel, T. (1962). Receptive elds, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol., 160, 106–154. HupÂe, J., James, A., Payne, B., Lomber, S., Girard, P., & Bullier, J. (1998). Cortical feedback improves discrimination between gure and background by V1, V2 and V3. Nature, 394, 784–787. Johnson, R., & Burkhalter, A. (1997). A polysynaptic feedback circuit in rat visual cortex. J. Neurosci., 17(18), 7129–7140. Lee, D., Itti, L., Koch, C., & Braun, J. (1999). Attention activates winner-take-all competition among visual lters. Nature Neuroscience, 2(4), 375–381. Li, Z., & Dayan, P. (1999). Computational differences between asymmetrical and symmetrical networks. Network, 10, 59–77. Logothetis, N., Pauls, J., Bulthoff, H., & Poggio, T. (1994). View dependent object recognition by monkeys. Curr. Biol., 4(5), 401–414. Lu, Z.-L., & Dosher, B. (1998). External noise distinguishes attention mechanisms. Vision Research, 38(9), 1183–1198. Luck, S. J., Chelazzi, L., Hillyard, S., & Desimone, R. (1997). Neural mechanisms of spatial selective attention in areas V1, V2, and V4 of macaque visual cortex. J. Neurophysiol., 77(1), 24–42. McAdams, C. J., & Maunsell, J. H. (1999). Effects of attention on the reliability of individual neurons in monkey visual cortex. Neuron, 23, 765–773. Moran, J., & Desimone, R. (1985). Selective attention gates visual processing in the extrastriate cortex. Science, 229, 782–784.
Recruitment of Inter-Areal Recurrent Networks
1689
Motter, B. (1993). Focal attention produces spatially selective processing in visual cortical areas V1, V2, and V4 in the presence of competing stimuli. J. Neurophysiol., 70(3), 909–919. Motter, B. (1994). Neural correlates of feature selective memory and pop-out in extrastriate area V4. J. Neurosci., 14(4), 2190–2199. Pouget, A., Zhang, K., Deneve, S., & Latham, P. (1998). Statistically efcient estimation using population coding. Neural Computation, 10, 373–401. Reynolds, J., Pasternak, T., & Desimone, R. (2000). Attention increases sensitivity of V4 neurons. Neuron, 26, 703–714. Reynolds, J. H., Chelazzi, L., & Desimone, R. (1999). Competitive mechanisms subserve attention in macaque areas V2 and V4. Journal of Neuroscience, 19(5), 1736–1753. Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience, 2, 1019–1025. Seung, H. S., Lee, D. D., Reis, B. Y., & Tank, D. D. (2000). Stability of the memory of eye position in a recurrent network of conductance-based model neurons. Neuron, 26, 259–271. Seung, H., & Sompolinsky, H. (1993). Simple models for reading neuronal population codes. Proc. Natl. Acad. Sci. USA, 90, 10749–10753. Spitzer, H., Desimone, R., & Moran, J. (1988). Increased attention enhances both behavioral and neuronal performance. Science, 240, 338–340. Tomita, H., Ohbayashi, M., Nakahara, K., & Miyashita, Y. (1999). Top-down signal from prefrontal cortex in executive control of memory retrieval. Nature, 401, 699–703. Treue, S., & Maunsell, J. (1996). Attentional modulation of visual motion processing in cortical areas MT and MST. Nature, 382, 539–541. Treue, S., & Maunsell, J. (1999).Effects of attention on the processing of motion in macaque middle temporal and medial superior temporal visual areas. Journal of Neuroscience, 19(17), 7591–7602. Wersing, H., Beyn, W.-J., & Ritter, H. (2001). Dynamical stability conditions for recurrent neural networks with unsaturating piecewise linear transfer functions. Neural Computation, 13, 1811–1825. Zhang, K., & Sejnowski, T. (1999). Neuronal tuning: to sharpen or broaden? Neural Computation, 11, 75–84. Received November 30, 2000; accepted November 20, 2001.
LETTER
Communicated by Johnathan Yedidia
CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies: Convergent Alternatives to Belief Propagation A. L. Yuille
[email protected] Smith-Kettlewell Eye Research Institute, San Francisco, CA 94115, U.S.A. This article introduces a class of discrete iterative algorithms that are provably convergent alternatives to belief propagation (BP) and generalized belief propagation (GBP). Our work builds on recent results by Yedidia, Freeman, and Weiss (2000), who showed that the xed points of BP and GBP algorithms correspond to extrema of the Bethe and Kikuchi free energies, respectively. We obtain two algorithms by applying CCCP to the Bethe and Kikuchi free energies, respectively (CCCP is a procedure, introduced here, for obtaining discrete iterative algorithms by decomposing a cost function into a concave and a convex part). We implement our CCCP algorithms on two- and three-dimensional spin glasses and compare their results to BP and GBP. Our simulations show that the CCCP algorithms are stable and converge very quickly (the speed of CCCP is similar to that of BP and GBP). Unlike CCCP, BP will often not converge for these problems (GBP usually, but not always, converges). The results found by CCCP applied to the Bethe or Kikuchi free energies are equivalent, or slightly better than, those found by BP or GBP, respectively (when BP and GBP converge). Note that for these, and other problems, BP and GBP give very accurate results (see Yedidia et al., 2000), and failure to converge is their major error mode. Finally, we point out that our algorithms have a large range of inference and learning applications. 1 Introduction
Recent work by Yedidia, Freeman, and Weiss (2000) unied two approaches to statistical inference. They related the highly successful belief propagation (BP) algorithms (Pearl, 1988) to variational methods from statistical physics and, in particular, to the Bethe and Kikuchi free energies (Domb & Green, 1972). These BP algorithms typically give highly accurate results (Yedidia et al., 2000). But BP algorithms do not always converge, and, indeed, failure to converge is their major error mode. This article develops new algorithms that are guaranteed to converge to extrema of the Bethe or Kikuchi free energies and hence are alternatives to belief propagation. Belief propagation (Pearl, 1988) is equivalent to the sum-product algorithm developed by Gallager (1963) to decode low-density parity codes c 2002 Massachusetts Institute of Technology Neural Computation 14, 1691– 1722 (2002) °
1692
A. L. Yuille
(LDPC). In recent years (see Forney, 2001, for a review), the coding community has shown great interest in sum-product algorithms and LDPCs. It is predicted that this combination will enable the coding community to design practical codes that approach the Shannon performance limit (Cover & Thomas, 1991) while requiring only limited computation. In particular, it has been shown that the highly successful turbo codes (Berrou, Glavieux, & Thitimajshima, 1993) can also be interpreted in terms of BP (McEliece, Mackay, & Cheng, 1998). Although BP has been proven to converge only for tree-like graphical models (Pearl, 1988), it has been amazingly successful when applied to inference problems with closed loops (Freeman & Pasztor, 1999; Frey, 1998; Murphy, Weiss, & Jordan, 1999) including these particularly important coding applications (Forney, 2001). When BP converges, it (empirically) usually seems to converge to a good approximation to the true answer. Statistical physics has long been a fruitful source of ideas for statistical inference (Hertz, Krogh, & Palmer, 1991). The mean-eld approximation, which can be formulated as minimizing a (factorized) mean-eld free energy, has been used to motivate algorithms for optimization and learning (see Arbib, 1995). The Bethe and Kikuchi free energies (Domb & Green, 1972) contain higher-order terms than the (factorized) mean-eld free energies commonly used. It is therefore hoped that algorithms that minimize the Bethe or Kikuchi free energies will outperform standard mean-eld theory and be useful for optimization and learning applications. Overall, there is a hierarchy of variational approximations in statistical physics that starts with mean eld, proceeds to Bethe, and nishes with Kikuchi (which subsumes Bethe as a special cases). Yedidia et al.’s result (2000) proved that the xed points of BP correspond to the extrema of the Bethe free energy. They also developed a generalized belief propagation (GBP) algorithm whose xed points correspond to the extrema of the Kikuchi free energy. In practice, when BP and GBP converge, they go to low-energy minima of the Bethe and Kikuchi free energies. In general, we expect that GBP gives more accurate results than BP since Kikuchi is a better approximation than Bethe. Indeed, empirically, GBP outperformed BP on two-dimensional (2D) spin glasses and converged very close to the true solution (Yedidia et al., 2000). But these results say little about failure to converge of BP (or GBP) (which are the dominant error modes of the algorithms; if BP or GBP converges, then they typically converge to a close approximation to the true solution). This motivates the search for other algorithms to minimize the Bethe and Kikuchi free energies. This article develops new algorithms that are guaranteed to converge to extrema of the Bethe and Kikuchi free energies. (In computer simulations so far, they always converged to minima.) The algorithms have some similarities to BP and GBP algorithms that estimate “beliefs” by propagating “messages.” The new algorithms also propagate messages (formally Lagrange multipliers), but unlike BP and GBP, the propagation depends
CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies
1693
on current estimates of the beliefs that must be reestimated periodically. This similarity may help explain when BP and GBP converge. But in any case, these new algorithms offer alternatives to BP and GBP that may be of particular use in regimes where BP and GBP do not converge. The algorithms are developed using a concave convex procedure (CCCP) that starts by decomposing the free energy into concave and convex parts. From this decomposition, it is possible to obtain a discrete update rule that decreases the free energy at each iteration step. This procedure is very general and is developed further by Yuille and Rangarajan (2001) with applications to many other optimization problems. It builds on results developed when studying mean-eld theory (Kosowsky & Yuille, 1994; Yuille & Kosowsky, 1994; Rangarajan, Gold, & Mjolsness, 1996; Rangarajan, Yuille, Gold, & Mjolsness, 1996). The algorithms were implemented and tested on 2D and 3D spin-glass energy functions. These contain many closed loops (so BP is expected to have difculties; Yedidia et al., 2000). Our results show that our algorithms are stable, converge very rapidly, and give equivalent or better results than BP and GBP (BP often fails to converge for these problems). We note that there has recently been a range of new algorithms that act as either variants or improvements of BP (this article was originally submitted before we learned of these alternative algorithms). These include Teh and Welling (2001), Wainwright, Jaakkola, and Willsky (2001), and Chiang and Forney (2001). Of these, the algorithm by Teh and Welling seems most similar to the CCCP algorithms. Their algorithm is also guaranteed to converge to an extremum of the Bethe free energy. Comparative simulations of the two algorithms have been performed (Teh and Welling, private communication, May 2001) and indicate that the differences between the two algorithms lie mainly in the convergence rates rather than in the quality of the solutions obtained. But these results are preliminary. The relative advantages of these algorithms are work for the future. The structure of this article is as follows. Section 2 describes the Bethe free energy and BP algorithms. In section 3, we describe the design principle, CCCP, of our algorithm. Section 4 applies CCCP to the Bethe free energy and gives a double-loop algorithm that is guaranteed to converge (we also discuss formal similarities to BP). In section 5, we apply CCCP to the Kikuchi free energy and obtain a provably convergent algorithm. Section 6 applies our CCCP algorithm to 2D and 3D spin-glass energy models using either the Bethe or Kikuchi approximation and compares their performance to BP and GBP. Finally, we discuss related issues in section 7. 2 The Bethe Free Energy and the BP Algorithms
The Bethe free energy (Domb & Green, 1972; Yedidia et al., 2000) is a variational technique from statistical physics. The idea is to replace an inference problem we cannot solve by an approximation that is solvable. For example, we may want to estimate the variables x¤1 , . . . , x¤N , which are the
1694
A. L. Yuille
Figure 1: Grid for a 2D spin glass. Each node a, b, c, d represents a binary-valued E Ei C D v E spin variable. Our convention is that a is site Ei; b, c, d are sites Ei C D h, E , Ei C 2D h.
most probable states of a distribution P ( x1 , . . . , xN |y) . This, however, may be computationally expensive (e.g., NP-complete). Instead, we can apply variational methods where we seek an approximate solution by minimizing a free energy function to obtain a mean eld-theory solution (see Arbib, 1995). The Bethe free energy is an approximation that uses joint distributions between variables. It can give exact solutions in situations where standard mean-eld theory gives only poor solutions (Weiss, 2001). As we will discuss later, the Kikuchi free energy gives higher-order approximations (see the discussion at the end of this section). Consider a graph with nodes i D 1, . . . , N. The problem specication will determine connections between the nodes. We will list connections ij only for node pairs i, j that are connected. (Variables, such as yij , bij , which depend on two nodes, will be dened only for pairs of nodes that are connected in the graph.) The state of a node is denoted by xi (each xi has M possible states). Each unobserved node is connected to an observed node yi . The joint probability function is given by P (x1 , . . . , xN |y) D
Y 1 Y yij ( xi , xj ) yi ( xi , yi ) , Z i, j: i > j i
(2.1)
where yi ( xi , yi ) is the local “evidence” for node i, Z is a normalization constant, and yij ( xi , xj ) is the compatibility matrix between nodes i and j. We use the convention i > j to avoid double counting. To simplify the notation, we write yi (xi ) as shorthand for yi (xi , yi ) . (Recall that if nodes i and j are not connected, then we do not have a term yij .) For example, in the 2D spin-glass network (see Figure 1), the variable i labels nodes on a 2D grid (see section 6). It is convenient to represent these nodes by vector labels Ei with the rst and second components corresponding E Dv to the positions in the 2D lattice and with D h, E corresponding to shifts be-
CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies
1695
Figure 2: The quantities a,c,b,d owing into the central pixel correspond to Lagrange multipliers or messages for CCCP or BP, respectively. They correspond to either lEi § D vE ,Ei ( xEi ) , lEi § D h,E Ei ( xEi ) or mEi § D h,E Ei ( xEi ) , mEi§ D vE ,Ei ( xEi ) .
tween lattice sites in the horizontal and vertical directions. The 2D spin glass involves nearest-neighbor interactions only. So we have potentials yEi ( xEi ) at each lattice site and potentials yEi,Ei C D hE ( xEi , xEi C D hE ) and yEi,Ei C D vE ( xEi , xEi C D vE ) linking neighboring lattice sites. The goal is to determine an estimate fbi (xi ) g of the marginal distributions fP(xi |y)g. It is convenient also to make estimates fbij (xi , xj )g of the joint distributions fP(xi , xj |y) g of nodes i, j, which are connected in the graph. (Again we use the convention i > j to avoid double counting.) The BP algorithm introduces variables mij ( xj ) , which correspond to “messages” that node i sends to node j (later we will see how these messages correspond to Lagrange multipliers, which impose consistency constraints on the beliefs). The BP algorithm is given by: X Y yij ( xi , xj ) yi ( xi ) (2.2) mij (xj I t C 1) D cij m ki ( xi I t) , xi
6 j kD
where cij is a normalization constant (i.e., it is independent of xj ). Again, we have messages only between nodes that are connected. Nodes are not connected to themselves (i.e., mii (xi , t) D 1, 8i). For the 2D spin glass (see Figure 1), we can represent the messages (and later the l’s from CCCP) by the nodes into which they ow (see Figure 2). The messages determine additional variables bi ( xi ) , bij ( xi , xj ) corresponding to the approximate marginal probabilities at node i and the approximate joint probabilities at nodes i, j (with convention i > j). These are given in terms of the messages by: Y (2.3) bi ( xi I t) D cOi yi ( xi ) m ki (xi I t) , k
bij ( xi , xj I t) D cNij w ij ( xi , xj )
Y 6 j kD
m ki ( xi I t)
Y 6 i lD
mlj ( xj I t) ,
(2.4)
1696
A. L. Yuille
where w ij ( xi , xj ) D yij ( xi , xj ) yi ( xi ) yj ( xj ) and cOi , cNij are normalization constants. For a tree, the BP algorithm of equation 2.2 is guaranteed to converge, and the resulting fbi ( xi ) g will correspond to the posterior marginals (Pearl, 1988). The Bethe free energy of this system is written as (Yedidia et al., 2000) Fb (fbij , bi g) D
X X
bij ( xi , xj ) log
i, j: i > j xi ,xj
¡
X i
(ni ¡ 1)
X
bij (xi , xj ) w ij ( xi , xj )
bi (xi ) log
xi
bi ( xi ) , yi ( xi )
(2.5)
where ni is the number of neighbors of node i. Because the fbi g and fbij g correspond to marginal and joint probability distributions, they must satisfy linear consistency constraints: X X bij ( xi , xj ) D 1, 8i, j: i > j bi (xi ) D 1, 8i, xi ,xj
X xi
xi
bij (xi , xj ) D bj ( xj ) , 8j, xj ,
X xj
bij ( xi , xj ) D bi ( xi ) , 8i, xi .
(2.6)
These constraints can be imposed by using Lagrange multipliers fc ij : i > 6 jg and flij ( xj ) : i D jg and adding terms: ( ) X X c ij bij ( xi , xj ) ¡ 1 ij: i > j
xi ,xj
C
C
X X
( lij ( xj )
X
i, j: i > j xj
X X i, j: i > j xi
xi
( lji ( xi )
) bij (xi , xj ) ¡ bj ( xj )
X xj
(2.7) )
bij ( xi , xj ) ¡ bi (xi )
.
(2.8)
The Bethe free energy consists of two terms. The rst is of the form of a Kullback-Leibler (K-L) divergence between fbij g and fw ij g (but it is not actually a K-L divergence because fw ij g is not a normalized distribution). The second is minus the form of the K-L divergence between fbi g and fyi g (again, fyi g is not a normalized probability distribution). It follows that the rst term is convex in fbij g, and the second term is concave in fbi g. This will be of importance for the derivation of our algorithm in section 3. Yedidia et al. (2000) proved that the xed points of the BP algorithm correspond to extrema of the Bethe free energy (with the linear constraints
CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies
1697
of equation 2.6). Their results can be obtained by differentiating the Bethe P lji ( xi ) ¡ n 1¡1 l ki ( xi ) i k ( ) free energy and using the substitution mji xi D e and the e Q ¡l ( x ) inverse e ij j D kD6 i m kj ( xj ) . Observe that the standard (factorized) mean-eld free energy (see Arbib, 1995) can be obtained from the Bethe free energy by setting bij (xi , xj ) D bi ( xi ) bj ( xj ) , 8i, j. This gives: Fmean¡eld D ¡ C
XX i
xi
bi (xi ) log yi ( xi ) ¡
XX i
bi (xi ) log bi ( xi ) .
XX
bi (xi ) bj ( xj ) log yij ( xi , xj )
i, j xi ,xj
(2.9)
xi
By comparing equation 2.5 to equation 2.9, we see that the Bethe free energy contains higher-order terms than the mean-eld energy. It is therefore plausible that algorithms that minimize the Bethe free energy will perform better for optimization and learning than those that use the mean-eld approximation. In general, (factorized) mean eld, Bethe, and Kikuchi give a hierarchy of approximations to the underlying distribution P (x1 , . . . P , xN |y). They can be derived by minimizing the K-L divergence D (P A kP) D x1 ,...,xN PA ( x1 , . . . , ( 1 ,...,xN ) with respect to an approximating distribution PA ( x1 , . . . , xN ) log PP(Ax1x,...,x N |y) xN ) . If the P A are restricted to being factorized distributions, P A (x1 , . . . , xN ) D QN ( ) i D 1 bi xi , then we obtain the mean eld free energy (see equation 2.9). We can obtain the Bethe and Kikuchi free energies by allowing P A (.) to have different forms (though additional approximations are still required to get Bethe or Kikuchi unless the graph has no loops). 3 The Concave Convex Procedure
Our algorithms to minimize the Bethe and Kikuchi free energies are based on the concave convex procedure (CCCP) described in this section. This approach was developed (in this article) for the specic case of Bethe and Kikuchi but is far more general. Indeed, many existing discrete-time iterative dynamical systems can be interpreted in terms of CCCP (Yuille & Rangarajan, 2001). Our main results are given by theorems 1, 2, and 3 and show that we can obtain discrete iterative algorithms to minimize energy functions that are the sum of a convex and a concave term. These algorithms will typically, but not always (Yuille & Rangarajan, 2001), require an inner and an outer loop. We rst consider the case where there are no constraints for the optimization. Then we generalize to the case where linear constraints are present.
1698
A. L. Yuille
Consider an energy function E (zE ) (bounded below) of form E (zE ) D Evex ( Ez ) C Ecave ( Ez ) where Evex ( Ez) , Ecave ( Ez) are convex and concave functions of Ez, respectively. Then the discrete iterative algorithm zE t 7! Ezt C 1 given by Theorem 1.
E Evex ( Ezt C 1 ) D ¡r E Ecave ( Ezt ) , r
(3.1)
is guaranteed to decrease the energy E (E) monotonically as a function of time and hence to converge to an extremum of E ( Ez) . Proof. The convexity and concavity of Evex (.) and Ecave (.) means that: E Evex ( Ez1 ) Evex ( Ez2 ) ¸ Evex ( Ez1 ) C (zE 2 ¡ Ez1 ) ¢ r
E Ecave ( zE 3 ) , E cave ( Ez4 ) · Ecave ( Ez3 ) C ( Ez4 ¡ Ez3 ) ¢ r
(3.2)
for all Ez1 , Ez2 , zE 3 , Ez4 . Now set zE 1 D Ezt C 1 , Ez2 D Ezt , zE 3 D Ezt , Ez4 D zE t C 1 . Using E Evex ( Ezt C 1 ) D ¡r E Ecave ( Ezt ) ) equation 3.2 and the algorithm denition (i.e., r we nd that: E vex ( Ezt C 1 ) C Ecave ( Ezt C 1 ) · E vex ( Ezt ) C Ecave (zE t ) ,
(3.3)
which proves the claim. E E vex ( Ezt C 1 ) Observe that the convexity of Evex ( Ez) means that the function r C1 t is invertible. In other words, the tangent to Evec (zE ) determines zE t C 1 uniquely. Theorem 1 generalizes previous results by Marcus, Waugh, and Westervelt (Marcus & Westervelt, 1989; Waugh & Westervelt, 1993) on the convergence of discrete iterated neural networks. (They discussed a special case where one function was convex and the other was implicitly, but not explicitly, concave). The algorithm can be illustrated geometrically by the reformulation shown in Figure 3. Think of decomposing the energy function E ( Ez) into E1 ( Ez ) ¡ E2 (zE ) where both E1 ( Ez) and E 2 ( Ez) are convex. (This is equivalent to decomposing E ( Ez ) into a convex term E1 ( Ez ) plus a concave term ¡E2 ( Ez) .) The algorithm proceeds by matching points on the two terms that have the E E2 ( zE 0 ) and nd same tangents. For an input Ez0 , we calculate the gradient r E E the point Ez1 such that r E1 ( Ez1 ) D r E2 ( Ez 0 ) . We next determine the point Ez2 E E1 ( Ez2 ) D r E E2 ( Ez1 ) , and repeat. such that r We can extend this result to allow for linear constraints on the variables Ez. This can be given a geometrical intuition. First, properties such as concavity and concaveness are preserved when linear constraints are imposed. Second, because the constraints are linear, they determine a hyperplane on which the constraints are satised. Theorem 1 can then be applied to the variables on this hyperplane.
CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies
1699
Figure 3: A CCCP algorithm illustrated for convex minus convex. We want to minimize the function in the left panel. We decompose it (right panel) into a convex part (top curve) minus a convex term (bottom curve). The algorithm iterates by matching points on the two curves that have the same tangent vectors. See the text for more details. The algorithm rapidly converges to the solution at x D 5.0. (Figure concept courtesy of James M. Coughlan.)
Consider a function E (zE ) D Evex ( Ez) C Ecave ( Ez ) subject to k linear constraints wE m ¢ Ez D cm where fcm : m D 1, . . . , kg are constants. Then the algorithm Ezt 7! Ezt C 1 given by Theorem 2.
E Evex ( zE t C 1 ) D ¡r E Ecave ( Ezt ) ¡ r
k X m D1
am wE m ,
(3.4)
1700
A. L. Yuille
where the parameters fam g are chosen to ensure that zE t C 1 ¢wE m D cm for m D 1, . . . , k, is guaranteed to decrease the energy E (zE t ) monotonically and hence to converge to an extremum of E ( Ez) . Proof. Intuitively, the update rule has to balance the gradients of Evex , Ecave
P in the unconstrained directions of Ez, and the term mk D 1 am wE m is required to deal with the differences in the directions of the constraints. More formally, we dene orthogonal unit vectors fyE º: º D 1, . . . , n ¡ kg, which span the P E º ( z¢ space orthogonal to the constraints fwE m : m D 1, . . . , kg. Let yE ( Ez ) D ºn¡k D1 y E º yE ) be the projection of Ez onto this space. Dene functions EO cave ( yE ) , EO vex ( yE ) by: EO cave ( yE ( Ez) ) D Ecave ( Ez) , EO vex ( yE ( Ez) ) D Evex (zE ) .
(3.5)
Then we can use the algorithm of theorem 1 on the unconstrained variables yE D ( y1 , . . . , yn¡k ) . By denition of yE ( Ez ), we have @Ez / @yº D yE º. Therefore, the algorithm reduces to E zE Evex ( Ezt C 1 ) D ¡y Eº ¢ r E Ez Ecave ( Ezt ) , º D 1, . . . , n ¡ k. yE º ¢ r
(3.6)
This gives the result (recalling that wE m ¢ yE º D 0 for all m , º). It follows from theorem 2 that we need to impose the constraints only on Evex ( Ezt C 1 ) and not on Ecave ( Ezt ) . In other words, we set EN vex ( Ezt C 1 ) D Evex ( Ezt C 1 ) C
X m
am fwE m ¢ Ezt C 1 ¡ cm g,
(3.7)
and use update equations: @EN vex @Ez
( Ezt C 1 ) D ¡
@Ecave @Ez
( Ezt ) ,
(3.8)
where the coefcients fam g must be chosen to ensure that the constraints wE m ¢ Ezt C 1 D cm , 8m are satised. We need an additional result to determine how to solve for Ezt C 1 . Theorem 3 gives a procedure that will work for the Bethe and Kikuchi energy functions. (In general, solving for Ezt C 1 is, at worse, a convex minimization problem and can sometimes be done analytically; Yuille & Rangarajan, 2001). P We restrict ourselves to the specic case where Evex ( Ez) D i zi log zfii . This form of E vex ( Ez) will arise when we apply theorem 2 to the Bethe free energy E Ecave ( Ez) . Then the update rules of theorem 2 (see section 4.) We let hE (zE ) D r (see equation 3.4) can be expressed as selecting Ezt C 1 to minimize a convex
CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies
1701
cost function Et C 1 ( Ezt C 1 ) . Moreover, we can write the solution as an analytic function of Lagrange multipliers fam g, which are chosen to maximize a concave (dual) energy function. More formally, we have the following result: P Let Evex ( Ez ) D i zi log fzii . Then the update equation of theorem 2 can be expressed as minimizing the convex energy function: Theorem 3.
Et C 1 (zE t C 1 ) D zE t C 1 ¢ hE C
X
zti C 1 log
zti C 1
i
C
fi
X m
am fwE m ¢ zE t C 1 ¡ cm g, (3.9)
E Ecave ( Ezt ) . The solution is of the form where hE D r P m m ¡ a wi m zti C 1 (a) D fi e ¡hi e¡1 e ,
(3.10)
where the Lagrange multipliers fam g are constrained to maximize the (concave) dual energy: X X am cm EO t C 1 (a) D ¡ zti C 1 (a) ¡ m
i
D ¡
X
fi e¡hi e ¡1 e¡
P º
w ºi aº
i
¡
X
am cm .
(3.11)
m
Moreover, maximizing EO t C 1 (a) with respect to a specic am enables us to satisfy the corresponding constraint exactly. Proof. This is given by straightforward calculations. Differentiating Et C 1
with respect to zti C 1 gives 1 C log
zi fi
D ¡hi ¡
X
m
am w i ,
(3.12)
m
which corresponds to the update equation 3.4 of theorem 2. Substituting Ezt C 1 (fam g) into Et C 1 ( Ezt C 1 ) gives the dual energy function EO t C 1 (a) D Et C 1 ( Ezt C 1 (am ) . Since E t C 1 (Ez ) is convex, duality ensures that the dual EO t C 1 (a) is concave (Strang, 1986), and hence has a unique maximum that corresponds to the constraints being satised. Setting and hence satises the m th constraint.
@EO t C 1 @am
D 0 ensures that Ezt C 1 ¢ wE m D cm ,
Theorem 3 species a double-loop algorithm where the outer loop is given by equation 3.10 and the inner loop determines the fam g by maximizing EO t C 1 (a) . For the Bethe free energy, solving for the fam g can be done by a discrete iterative algorithms. The algorithm maximizes EO t C 1 (a) with respect to
1702
A. L. Yuille
each am in turn. This maximization can be done analytically for each am . This inner loop algorithm generalizes work by Kosowsky and Yuille (1994; Kosowsky, 1995), who used a result similar to theorem 3 to obtain an algorithm for solving the linear assignment problem. This relates to the Sinkhorn algorithm (Sinkhorn, 1964) (which converts positive matrices into doubly stochastic ones). Rangarajan, Gold, and Mjolsness (1996) applied this result to obtain double-loop algorithms for a range of optimization problems subject to linear constraints. In the next section, we apply theorems 2 and 3 to the Bethe free energy. In particular, we will show that the nature of the linear constraints for the Bethe free energy means that solving for the constraints in theorem 2 can be done efciently. 4 A CCCP Algorithm for the Bethe Free Energy
In this section, we return to the Bethe free energy and describe how we can implement an algorithm of the form given by theorems 1, 2, and 3. This CCCP double-loop algorithm is designed by splitting the Bethe free energy into convex and concave parts. An inner loop is used to impose the linear constraints. (This design was inuenced by the work of Rangarajan et al. (Rangarajan, Gold, & Mjolsness, 1996; Rangarajan, Yuille, et al., 1996).) First, we split the Bethe free energy in two parts:
Evex D
X X
bij ( xi , xj ) log
i, j: i > j xi ,xj
E cave D ¡
X i
ni
X xi
bi ( xi ) log
XX bij ( xi , xj ) bi (xi ) C bi ( xi ) log , w ij ( xi , xj ) yi ( xi ) i xi bi ( xi ) . yi ( xi )
(4.1)
This split enables us to get nonzero derivatives of Evex with respect to both fbij g and fbi g. (Other choices of split are possible). We now need to express the linear constraints (see equation 2.6) in the form used in theorems 2 and 3. To do this, we set zE D ( bij ( xi , xj ) , bi (xi ) ) so that the rst components of Ez correspond to the fbij g and the latter to the fbi g (There are NM components for the fbi g, but the number of fbij g depends on ) 2 the number of connections and is, at most, N ( N¡1 dot product of zE 2 P M ). The P E ( ( ) ( ) ) with a vector w D Tij xi , xj , Ui xi is given by i, j: i > j xi ,xj bij ( xi , xj ) Tij ( xi , P P xj ) C i xi bi ( xi ) Ui ( xi ). P There are two types of constraints: (1) the normalization constraints P ( ) (2) the consistency constraints xp bpq xp ,xq b pq xp , xq D 1, 8p, q: p > q and P (xp , xq ) D bq ( xq ) , 8p, q, xq : p > q and xq bpq ( xp , xq ) D bp (xp ) , 8q, p, xp : p > q. We index the normalization constraints by pq (with p > q). Then we can express the constraint vectors fwE pqg, the constraint coefcients fapqg, and the
CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies
1703
constraint values fcpqg by wE pq D (dpidqj , 0) , apq D c pq, cpq D 1, 8p, q.
(4.2)
The consistency constraints are indexed by pqxq (with p > q) and qpxp (with p > q). The constraint vectors, the constraint coefcients, and the constraint values are given by: wE pqxq D (dipdjqdxj ,xq , ¡diqdxi ,xq ) , apqxq D lpq ( xq ) , cpqxq D 0, 8p, q, xq wE qpxp D (dipdjqdxi xp , ¡dipdxi ,xp ) , aqpxp D lqp ( xp ) , cqpxp D 0, 8q, p, xp .
(4.3)
We now apply theorem 3 to the Bethe free energy and obtain theorem 4, which species the outer loop of our algorithm: Theorem 4 (CCCP Outer Loop for Bethe).
The following update rule is guaranteed to reduce the Bethe free energy provided the constraint coefcients fc pq g, flpqg, flqp g can be chosen to ensure that fbij ( t C 1) g, fbi ( t C 1)g satisfy the linear constraints of equation 2.6: ¡l (x ) ¡l ( x ) ¡c
bij ( xi , xj I t C 1) D w ij ( xi , xj ) e¡1 e ij j e ji i e ij , ³ ´ bi ( xi I t) ni P lki ( xi ) ¡1 n i . bi ( xi I t C 1) D yi ( xi ) e e e k yi ( xi )
(4.4)
Proof. These equations correspond to equation 3.10 in theorem 3, where
we have substituted equation 4.1 into equation 3.4 equations 4.2 Pandm used E m simplify owing and 4.3 for the constraints. The constraint terms a w m P to the form of the constraints (e.g., p,q: p > q c pqdipdjq D c ij ). We need an inner loop to determine the constraint coefcients flij , lji , c ij g. This is obtained by using theorem 3. Recall that nding the constraint coefcients is equivalent to maximizing the dual energy EO t C 1 (a) and that performing the maximization with respect to the m th coefcient am corresponds to solving the m th constraint equation. For the Bethe free energy, it is possible to solve the m th constraint equation to obtain an analytic expression for the m th coefcient in terms of the remaining coefcients. Therefore, we can maximize the dual energy EO t C 1 (a) with respect to any coefcient am analytically. Hence, we have an algorithm that is guaranteed to converge to the OtC1 maximum of EO t C 1 (a) : select a constraint m , solve for the equation @@Eam for m a analytically, and repeat. As we will see in section 4.1, this inner loop is very similar to BP (provided we equate the messages with the exponentials of the Lagrange multipliers).
1704
A. L. Yuille
More formally, we specify the inner loop by theorem 5: The constraint coefcients fc pqg, flpq g, flqpg of theorem 4 can be solved for by a discrete iterative algorithm, indexed by t , guaranteed to converge to the unique solution. At each step, we select coefcients c pq , lpq ( xq ) , or lqp ( xp ) and update them by Theorem 5 (CCCP Inner Loop for Bethe).
ec pq
(t C 1)
D
X
w pq (xp , xq ) e¡1 e¡lpq ( xq ,t ) e¡lqp ( xp It ) ,
xp ,xq
P
e
2lpq ( xq It C 1)
w pq ( xp , xq ) e ¡lqp ( xp It ) e ¡c pq (t ) D , ± ( ² P l ( xq It ) b x It) nq yq (xq ) enq yqq ( xqq It) e jD p jq xp
6
P
e
2lqp ( xp It C 1)
D
w pq ( xp , xq ) e ¡lpq ( xq It ) e¡c pq (t ) , ± ( ² P l ( xp It ) b x It) np yp ( xp ) enp ypp ( xpp It) e jD q jp xq
(4.5)
6
which monotically maximizes the dual energy: EO t C 1 (a) D ¡ ¡
X X
w ij ( xi , xj ) e¡1 e
¡lij ( xj ) ¡lji ( xi ) ¡c ij
e
e
i, j: i > j xi ,xj
XX i
xi
³
yi ( xi ) e¡1 eni
bi ( xi I t) yi ( xi )
´ni P X ( ) c ij . (4.6) e k lki xi ¡ i, j: i > j
Proof. We use the update rule, equation 3.10, given by theorem 3 and
calculate the constraint equations, Ez ¢ wE m D cm , 8m . For the Bethe free energy, we obtain equations 4.5, where the upper equation corresponds to the normalization constraints and the lower equations to the consistency constraints. Observe that we can solve each equation analytically for the corresponding constraint coefcients c pq , lpq ( xq ) , lqp (xp ) . By theorem 3, this is equivalent to maximizing the dual energy (see equation 3.11), with respect to each coefcient. Since the dual energy is concave, solving for each coefcient c pq, lpq ( xq ) , lqp ( xp ) (with the others xed) is guaranteed to increase the dual energy. Hence, we can maximize the dual energy, and ensure the constraints are satised, by repeatedly selecting coefcients and solving the equations 4.5. The form of the dual energy is given by substituting equation 4.4 into equation 3.11. Observe that we can update all the coefcients fc pq g simultaneously because their update rule (the right-hand side of the top equation of 4.5) depends only on the flpq g. Similarly, we can update many of the flpqg simultaneously because their update rules (the right-hand sides of the middle and bottom equation of 4.5) depend on only a subset of the flpq g. (For example,
CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies
1705
when updating lpq (xq ), we can simultaneously update any lij ( xj ) provided 6 p D 6 6 q D 6 iD j and i D j.) 4.1 Connections Between CCCP and BP . As we will show, CCCP and BP are fairly similar. The main difference is that the CCCP update rules for the Lagrange multipliers fl, c g depend explicitly on the current estimates of the beliefs. The estimates of the beliefs must then be reestimated periodically (after each run of the inner loop). By contrast, for BP, the update rules for the messages are independent of the beliefs. We now give an alternative formulation of the CCCP algorithm that simplies the algorithm (for implementation) and makes the connections to belief propagation more apparent. Dene new variables fhpq, hp , gpq , gp g by: ( ) ( ) hpq ( xp , xq ) D w pq ( xp , xq ) e ¡1 , gpq (l, c ) D e¡c pq ¡lpq xq ¡lqp xp » ¼ P bq (xq ) nq l ( xq ) hq ( xq ) D yq (xq ) e ¡1 enq , gq (l) D e j jq . ( ) yq xq
(4.7)
The outer loop can be written as: bij ( xi , xj ) D hij ( xi , xj ) gij (l , c ) bi ( xi ) D hi (xi ) gi (l).
(4.8)
The inner loop can be written as: X ec pq (t C 1) D ec pq (t ) hpq ( xp , xq ) gpq (l , c ) , xp ,xq
P ( e2lpq xq It C 1)
2lpq (xq It )
D e
xp
hpq (xp , xq ) gpq (l , c ) hq (xq ) gq (l)
.
(4.9)
This formulation of the algorithm is easier to program. As before, the inner loop is iterated until convergence, and then we perform one step of the outer loop and repeat. We now show a formal similarity between CCCP and belief propagation. To do this, we collapse the outer loop of CCCP by solving for the fbij , bi g as functions of the variables fl, c g. This is done by solving the xed-point equation 4.8 using equation 4.7 to obtain fb¤i , b¤ij g given by: b¤i ( xi ) D yi ( xi ) e ¡1 fgi (l) g1 /
( ni ¡1)
, b¤ij (xi , xj ) D w ij (xi , xj ) e ¡1 gij (l , c ). (4.10)
Now substitute fb¤i , b¤ij g into the inner loop update equation, 4.9. This collapsed CCCP algorithm reduces to: P ¡ 1 l ( xq It ) g e2lpq (xq It C 1) £ fe¡lpq ( xq It ) e nq ¡1 j jq X ypq ( xp , xq ) yp ( xp ) e ¡lqp ( xp It ) e¡c pq . (4.11) D xp
1706
A. L. Yuille
To relate this to belief propagation, we recall from section 2 that the messages can be related to the Lagrange multipliers by: mji ( xi ) D e
lji ( xi ) ¡ 1 ni ¡1
e
P k
l ki ( xi )
,e
¡lij ( xj )
D
Y
m kj ( xj ).
(4.12)
6 i kD
We can reexpress belief propagation as: e
1 lij ( xj It C 1) ¡ nj ¡1
e
P k
l kj ( xj It C 1)
D c
X
yij ( xi , xj ) yi ( xi ) e
xi
¡ bi ( xi I t ) D cOyi ( xi ) e ni ¡1 1
P l
¡lji ( xi It)
lli ( xi It)
.
,
(4.13) (4.14)
This shows that belief propagation is similar to P the collapsed CCCP, with ¡
1
l ( xq It )
the only difference that the factor fe¡lpq ( xq It ) e nq ¡1 j jq g is evaluated at time t for the collapsed double loop and at time t C 1 for belief propagation. (This derivation has ignored the “normalization dynamics”—updating the fc pqg—because it is straightforward and can be taken for granted. With this understanding we have dropped the dependence of the fc pq g on t .) 5 The Kikuchi Approximation
The Kikuchi free energy (Domb & Green, 1972; Yedidia et al., 2000) is a generalization of the Bethe free energy, which enables us to include higherorder interactions. In this section, we show that the results we obtained for the Bethe can be extended to Kikuchi. Recall that Yedidia et al. (2000) derived a “generalized belief propagation algorithm” whose xed points correspond to extrema of the Kikuchi free energy. Their computer simulations showed that GBP outperforms BP on 2D spin glasses and obtains results close to the optimum (presumably because Kikuchi is a higher-order approximation than Bethe). We now dene the Kikuchi free energy (following the exposition in Yedidia et al., 2000). For a general graph, let R be a set of regions that include some basic clusters of nodes, their intersections, the intersections of the intersections, and so on. The Bethe approximation corresponds to the special case where the basic clusters consist of all linked pairs of nodes. For any region r, we dene the superregions of r to be the set sup ( r) of all regions in R that contain r. Similarly, we dene the subregions of r to be the set sub (r ) of all regions in R that lie completely within r. In addition, we dene the direct subregions of r to be the set subd ( r) of all subregions of r that have no superregions that are also subregions of r. Similarly, dene the direct superregions of r to be the set sup d (r ) of superregions of r that have no subregions that are superregions of r. If s is a direct subregion of r, then we dene rns to be those nodes that are in r but not in s.
CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies
1707
Figure 4: The regions for the Kikuchi implementation of the 2D spin glass. (Left) Top regions. (Center and right) Subregions.
We illustrate these denitions on the 2D spin glass (see Figure 1). The basic regions are shown in Figure 4 (other choices of basic regions are possible). The direct superregions and the direct subregions are shown in Figures 5 and 6. Let xr be the state of the nodes in region r and br ( xr ) be the “belief” in xr . x 2 rns denotes the state of the nodes that are in r but not in s. Any region r has an energy Er ( xr ) associated with Q it (e.g., for a region with Q pairwise interactions, we have E r ( xr ) D ¡ log i2r, j2r: i > j yij (xi , xj ) ¡ log i2r yi ( xi )).
Figure 5: The direct subregions. The top-level regions (top panel) have four direct subregions (two horizontal and two vertical). Each middle-level region has two direct subregions (bottom two panels).
1708
A. L. Yuille
Figure 6: The direct superregions. The middle-level regions have two direct superregions (top two panels). The bottom-level region has four direct superregions (bottom panel).
Then the Kikuchi free energy is
FK D
X r2R
( cr
X xr
br (xr ) Er (xr ) C
X xr
) br ( xr ) log br (xr )
C LK ,
(5.1)
where LK gives the constraints (see equation 5.2 below) and cr is an “overP counting” number of region r, dened by cr D 1 ¡ s2sup( r ) cs where sup ( r) is the set of all regions in R that contain r. For the largest regions, we have cr D 1. We let cmax D maxr2R cr . The beliefs br (xr ) must obey two types of constraints: (1) they must all sum to one, and (2) they must be consistent with the beliefs in regions that intersect with r. This consistency can be ensured by imposing consistency between all regions with their direct subregions (i.e., between all r 2 R with their direct subregions s 2 sub d ( r) ). We can impose the constraints by
CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies
1709
Lagrange multipliers:
LK D
X
Á cr
X xr
r2R
C
X
! br (xr ) ¡ 1
X
X
( lr,s ( xs )
r2R s2subd ( r) xs
X x2rns
) br ( xr ) ¡ bs ( xs ) .
(5.2)
5.1 A CCCP Algorithm for the Kikuchi Free Energy. Now we obtain a CCCP algorithm guaranteed to converge to an extrema of the Kikuchi free energy. First, we observe that the Kikuchi free energy, like the Bethe free energy, can be split into a concave and a convex part. We choose the following split (many others are possible):
FK D FK,vex C FK,cave ,
(5.3)
where FK,vex D cmax C
r2R
X
cr
xr
£ X r2R
X x2rns
br ( xr ) Er ( xr ) C
Á X xr
r2R
(
FK,cave D
( X X
! br ( xr ) ¡ 1 C
X
) br ( xr ) log br ( xr )
(5.4)
xr
X
X
X
lr,s ( xs )
r2R s2subd ( r) xs
) br ( xr ) ¡ bs ( xs ) (
( cr ¡ cmax )
X xr
br (xr ) Er (xr ) C
X
) br ( xr ) log br (xr ) , (5.5)
xr
where we have used the results stated in equations 3.7 and 3.8 to impose the constraints only on the convex term. It can be readily veried that FK,vex is a convex function of fbr ( xr ) g, and FK,cave is a concave function. Recall that cmax D maxr2R cr . This ensures that FK,cave is concave. Theorem 6 (CCCP for Kikuchi).
The Kikuchi free energy FK can be minimized by a double-loop algorithm. The outer loop has time parameter t and is given by: cr
c max ¡c r
br ( xr I t C 1) D e ¡ cmax fEr xr 1g e ¡c r fbr ( xr I t) g cmax P P ¡ l (x ) lv, r ( xr ) s2subd (r) r, s s £e . e v2supd (r) ( )C
(5.6)
1710
A. L. Yuille
The inner loops, to solve for the fc r g, flr,s g, have time parameter t and are given by: ec r (t C 1) X D
e
cr ¡ cmax fEr ( xr ) C 1g
xr
fbr ( xr I t )g
cmax ¡cr c max
e
¡
P s2subd (r)
lr, s ( xs It )
P lv, r ( xr It ) £ e v2supd (r) ,
e2lr, u ( xu It C 1)
P
D
£e e
cr cmax ¡cr ( )C e¡ cmax fEr xr 1g fbr (xP r I t ) g cmax x2rnuP lr, s ( xs It ) l (x It ) ¡c r (t ) ¡ s2subd (r): sD 6 u v2supd (r) v,r r
e
cu ¡ cmax
£e
¡
P
e
fEu ( xu ) C 1g s2subd (u)
fbu ( xuP I t)g
lu,s (xs It )
e
cmax ¡c u c max
e ¡c u (t )
v2supd (u): vD 6 r
lv,u ( xu It )
(5.7)
.
Moreover, the inner loop is guaranteed to satisfy the constraints by converging to the unique maximum of the dual energy: EO t C 1 (l , c ) D ¡
XX
cr
e ¡ cmax fEr
( xr ) C 1g ¡c r
r2R xr
£e
¡
P
s2subd (r)
lr, s ( xs )
e
fbr ( xr I t) g
c max ¡c r cmax
P X lv,r ( xr ) ¡ c r. e v2supd (r)
(5.8)
r2R
Proof. We split the Kikuchi free energy FK into the convex and concave
parts, FK,vex , FK,cave , given by equation 5.5. We can use Theorems 1, 2, and 3 to obtain an algorithm to minimize the Kikuchi free energy. In more detail, we calculate the derivatives: 8 < @FK,vex D cmax Er ( xr ) C 1 C log br ( xr ) C c r : @br ( xr )
C
@FK,cave @br ( xr )
X s2subd (r )
lr,s (xs ) ¡
9 =
X
lv,r (xr )
v2supd ( r)
D ( cr ¡ cmax ) fEr ( xr ) C 1 C log br ( xr )g,
where we have not included constraint terms in absorbed into the constraints in @@FbK,r (vex . (Recall xr ) @FK, vex @FK, cave that we are setting @br ( xr ) ( t C 1) D ¡ @br ( xr ) ( t) ).
@FK, cave @br ( xr )
;
,
(5.9)
(5.10) because they can be
from Theorems 1, 2, and 3
CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies
1711
From equations 5.9 and 5.10, we can obtain the outer loop of our update algorithm (using theorems 2 and 3): cr
c max ¡c r
br ( xr I t C 1) D e ¡ cmax fEr xr 1g fbr ( xr I t) g cmax e¡c r P P ¡ l (x ) lv, r ( xr ) s2subd (r) r, s s £e . e v2supd (r) ( )C
(5.11)
The inner loop, to solve for the fc r g, flr,s g, is determined by theorem 2. We know that solving the constraint equations one by one is guaranteed to convergePto the unique solution for the constraints. The constraint equation for c r is xr br ( xr ) D 1, which can be solved to give an analytic expression for c r : ec r D
X
cr
e¡ cmax fEr
xr
£e
¡
( xr ) C 1g
P s2subd (r)
fbr ( xr I t)g
lr,s (xs )
cmax ¡cr c max
P lv, r ( xr ) . e v2supd (r)
The constraint equation for lr,u ( xu ) is X
cr
e¡ cmax fEr
( xr ) C 1g
x2rnu
fbr (xr I t) g
cu ¡ cmax fEu ( xu ) C 1g
cmax ¡cr cmax
P
(5.12)
x2rnu br ( xr )
e ¡c r e
¡
P s2subd (r)
lr, s ( xs )
P e
v2supd (r)
lv, r ( xr )
c max ¡cu
fbu (xu I t) g c max e¡c u P P ¡ l (x ) lv,u ( xu ) s2subd (u) u, s s £e e v2supd (u) ,
De
D bu ( xu ), which yields:
(5.13)
which can be rearranged to give an analytic solution for lr,u ( xu ) (see equation 5.7). The dual energy is given by theorem 3. Hence the result. 5.2 CCCP Kikuchi and Generalized Belief Propagation. We now generalize our arguments for the Bethe free energy (see subsection 4.1) and show there is a connection between CCCP for Kikuchi and GBP. Once again, CCCP is very similar to message passing, but it requires us to update our estimates of the beliefs periodically. First, we simplify the form of the CCCP Kikuchi. Then we briey describe GBP. Finally, we show connections between CCCP and GBP by collapsing the outer loop. We simplify the double-loop Kikuchi algorithm by dening new variables: cr
cmax ¡cr
C hr ( xr ) D e ¡ cmax fEr xr 1g fbr (xr ) g cmax , P P ¡c r ¡ s2sub (r) lr,s (xs ) C v2sup (r) lv, r ( xr ) d d . gr ( xr ) D e
( )
(5.14)
1712
A. L. Yuille
The outer loop rule becomes br ( xr ) D hr ( xr ) gr ( xr ) .
(5.15)
The inner loop becomes: X ec r (t C 1) D ec r (t ) hr (xr ) gr (xr ) , xr
e
2lr, u ( xu It C 1)
P
2lr,u ( xu It )
D e
x2rnu hr ( xr ) gr ( xr )
hu ( xu ) gu ( xu )
.
(5.16)
This form of the update rules is straightforward to program and, as we now show, relate to the generalized belief propagation algorithm of Yedidia et al. (2000). We now give a formulation of generalized belief propagation. Following Yedidia et al. (2000), we allow messages mr,s ( xs ) between regions r and any direct subregions s 2 sub d ( r ). We also dene M (r ) to be the set of all messages that ow into r, or any of its subregions, from outside r (i.e., mr0 ,s0 ( xs0 ) 2 M ( r) provided s0 is a subregion of r, or r itself, and r0 ns is outside r). The messages will correspond to (new) Lagrange multipliers fm g related to the messages by m r,s (xs ) D log mr,s ( xs ) . The main idea is to introduce new Lagrange multipliers fm r,s g so that: X
X
X
(
lr,s ( xs )
r2R s2subd ( r) xs
D
X r
cr
x2rns
X xr
X
)
br ( xr ) ¡ bs (xs )
X
br ( xr )
m r0 ,s0 (xs0 ) .
(5.17)
fr0 ,s0 g2M(r)
By extremizing the Kikuchi free energy, we can nd the solution to be: Y (5.18) br ( xr ) D yr ( xr ) mr0 ,s0 (xs0 ) , fr0 ,s0 g2M(r)
where we relate the messages to the Lagrange multipliers by m r,s ( xs ) D log mr,s ( xs ) . Generalized belief propagation can be reformulated as: P Q fr0 ,s0 g2M(r) mr0 ,s0 ( xs0 ) x2rns yr ( xr ) Q . (5.19) mr,s ( xs I t C 1) D mr,s ( xs I t ) ys ( xs ) fr0 ,s0 g2M(s) mr0 ,s0 (xs0 ) It is clear that xed points of this equation will correspond to situations P where the constraints are satised (because the numerator will equal x2r/ s br ( xr ) and the denominator is b s ( xs )). The form of br ( xr ) means that this is all we need to do.
CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies
1713
To relate this to the Kikuchi double loop, we once again collapse the outer loop by solving for the fbr g in terms of the Lagrange multipliers flg. This gives: b¤r ( xr ) D e ¡Er ( xr ) ¡1 fgr (l , c ) gcmax / cr .
(5.20)
These are precisely the same form as those given by generalized belief propagation after we solve for the relationship between the Lagrange multipliers to be: X s2subd (r )
lr,s ( xs ) ¡
X p2supd ( r)
lp,r ( xr ) D cr
X
m r0 ,s0 ( xs0 ).
(5.21)
fr0 .s0 g2M(r)
The updates are then given by: ec r (t C 1) D ec r (t ) 2lr,u ( xu It C 1)
e
X xr
2lr, u ( xu It )
D e
b¤r ( xr ) , P
¤ x2rns br ( xr ) b¤u (xu )
.
(5.22)
So these are very similar to generalized belief propagation, using equations 5.18 and 5.19. The only difference is that we are updating the flg instead of the fm g, which are related by a linear transformation. 6 Implementation
We implemented the Bethe and Kikuchi CCCP algorithms in numeric Python on spin-glass energy functions. We used binary state variables dened on 2D and 3D lattices (using toroidal boundary conditions). This is similar to the setup used in Yedidia et al. (2000) and Teh and Welling (2001). We also implemented BP and GBP for comparison. (These examples were chosen because it is known that BP has difculties with graphs with so many closed loops.) The grid was shown in Figure 1 (imagine a third dimension to get the 3D cube). We label the nodes by Ei where each component of Ei takes values from 1 to N. We let Ei § D hE and Ei § D vE be the horizontal and vertical neighbors of site Ei. Each site has a spin state xEi 2 f0, 1g. There are (unary) potentials at each lattice site labeled by yEi ( xEi ). There are horizontal and vertical connections between neighboring lattice sites, which we label as yEi,Ei C D hE (xEi , xEi C D hE ) and yEi,Ei C D vE ( xEi , xEi C D vE ) , respectively (i.e., linking node a to b and a to c in Figure 1). In sum, we have N 2 unary potentials yEi , N 2 horizontal pairwise potentials yEi,Ei C D hE , and N 2 vertical pairwise potentials yEi,Ei C D vE . The size of N was varied from 10 to 100.
1714
A. L. Yuille
The yEi ( xEi ) are of form e§ hEi where the hEi are random samples from a gaussian distribution N (m , s) . The horizontal potentials are of form e§ hEi,Ei C D hE where the hEi,Ei C D hE are random samples from a gaussian N (m , s) . The vertical potentials are generated in a similar manner. The parameters of these three gaussians were varied (typically we set m D 0 and s 2 f1, 2, 5g). 6.1 Bethe Implementations. We derived the Bethe free energy for the problem specied above and implemented the CCCP and the BP algorithm. We use randomized initial conditions on the messages or the Lagrange multipliers (as appropriate). We used unary beliefs bEi (xEi ) and horizontal joint beliefs bEi,Ei C D h (xEi , xEi C D hE ) and vertical joint beliefs bEi,Ei C D v ( xEi , xEi C D vE ). For CCCP, we have Lagrange parameter terms lEi§D vE ,Ei ( xEi ) and lEi§ D h, E Ei ( xEi ) . These are analogous to the BP messages terms mEi§ D vE,Ei ( xEi ) and mEi§ D h, E Ei ( xEi ) . They can be thought of as the messages that ow into node Ei (see Figure 2). For CCCP, we also have normalization terms c Ei,Ei C D hE and c Ei,Ei C D vE to normalize the horizontal and vertical joint beliefs. For the CCCP algorithm, we used ve iterations of the inner loop as a default. This gave stable convergence, although it did not always ensure that the constraints were satised. We experimented with fewer iterations. The algorithm was unstable if only one inner loop iteration was used, but it already showed stability when we used two iterations. The BP algorithm was implemented in the standard manner. We used a parallel update rule. For both algorithms, we measured convergence by the Bethe free energy or by the “belief difference”: the differences in the belief estimates between adjacent iterations. Note that the Bethe free energy is not very meaningful for the BP algorithm until the algorithm has converged (because only then will the constraints be satised). The CCCP algorithm converged rapidly with typically fewer than six iterations of the outer loop (see Figure 7 for 2D spin glass and Figure 8 for 3D spin glass). Most of the Bethe free energy decrease happened in the rst two iterations. Varying the standard deviation of the gaussians (which generate the potential) made no difference. It also happened when we altered the size of the lattice (our default lattice size was 10 £ 10 but we also explored 50 £ 50, 10 £ 10 £ 10, and 20 £ 20 £ 20). During the rst six iterations, the constraints were not always satised (recall we are using ve iterations of the inner loop; see, for example, the top right panel of Figure 8), where the free energy appears to increase at iteration 5. But this (temporary) noncompliance with the constraints did not seem to matter, and the Bethe free energy (almost) always decreased monotonically. The BP algorithm behaved differently as we varied the situation. It was far less stable in general. For the 2D spin glass, the algorithm would not always converge. The convergence was better the smaller the variance of
CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies
1715
Figure 7: Performance of CCCP and BP on 2D spin glasses. Bethe free energy plots (top panels) and belief difference plots (bottom panels). Left to right: s D 1.0, 2.0, 5.0. CCCP solution (full lines), BP solution (dashed lines). Observe that CCCP is more reliable for s ¸ 2 (although BP often converged correctly for s D 2.0). Each iteration of CCCP counts as ve iterations of BP (we use ve iterations in the inner loop). For BP, the Bethe free energy is not very meaningful until convergence.
the gaussian distributions generating the potentials. So if we set sp D sh D sv D 1.0, then BP would often converge. But it became far more unstable when we increased the standard deviations to sp D sh D sv D 5.0. BP did not converge at all for the 3D spin-glass case. (Note we did not try to help convergence by adding inertia terms; see section 6.2). Overall, CCCP was far more stable than BP, and the convergence rate was roughly as fast (taking into account the inner loops). The Bethe free energy of the CCCP solution was always as small as, or smaller than, that found by BP. The forms of the solutions found by both algorithms (when BP converged) were very similar (see Figure 9) though we observed that BP tended to favor beliefs biased toward 0 or 1 while CCCP often gave beliefs closer to 0.5. (By contrast, CCCP for Kikuchi did not show any tendency toward 0.5.) We conclude that on these examples, CCCP outperformed BP as far as stability and quality of the results (particularly for 3D spin glasses). These results are preliminary, and more systematic experiments should be performed. But they do provide proof of concept for our approach. 6.2 Kikuchi Implementations. We also implemented CCCP and GBP for the Kikuchi free energy. This was done for the 2D spin-glass case only.
1716
A. L. Yuille
Figure 8: Performance of CCCP and BP on 3D spin glasses. (Top panels) Bethe free energy plots and (bottom panels) belief difference plots. Left to right: s D 1.0, 2.0, 5.0. CCCP solution (full lines), BP solution (dashed lines).
Once again, we found that CCCP converged very quickly (again we used ve iterations of the inner loop as a default). Following Yedidia et al. (2000), we implemented GBP with inertia (we could not get the algorithm to converge without inertia). Yedidia et al. (2000) used an inertia factor of a D 0.5. We were more conservative and set a D 0.9 (which improved stability). (Inertia means that messnew D amessold C (1 ¡ a) messupdate where a D 1 is standard BP and GBP.)
Figure 9: Bethe 2D spin glass. (Left) CCCP solution. (Center and right) Two BP solutions with different initial conditions. We plot the beliefs bi ( xi D 0) for all nodes i on the 2D lattice. Observe the similarity between them.
CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies
1717
Figure 10: Performance of CCCP Kikuchi and GBP on 2D spin glasses. (Top panels) Kikuchi free energy plots and (botton panels) belief difference plots. Left to right: s D 1.0, 2.0, 5.0. CCCP Kikuchi solution (full lines) and GBP solution (dashed lines).
Both algorithms converged with these settings. But CCCP converged more quickly, while GBP appeared to oscillate a bit about the solution before converging. Both gave similar results for the nal Kikuchi free energy (See Figure 10). The solutions found by the two algorithms were very similar (see Figure 11). Moreover, the solutions appeared to be practically independent of the starting conditions (which were randomized). Yedidia et al. (2000) reported that GBP gave results on 2D spin glasses that were almost identical to the true solutions (found by using a maximum ow algorithm). This is consistent with our nding that both GBP and CCCP converged to very similar solutions despite randomized starting conditions. We now give more details of the implementation. The top-level regions ( are 2 £ 2 squares. These are labeled bEi,Ei C D h, E Ei C D vE ,Ei C D E, E hE C D Ev E xEi , xEi C D hE , xEi C D v xEi C D hE C D vE ) . There are two “medium-level” regions (obtained by taking the intersections of adjacent top-level regions). These are horizontal bEi,Ei C D h (xEi , xEi C D hE ) and vertical bEi,Ei C D v ( xEi , xEi C D vE ). There are bottom-level regions corresponding to the pixels bEi ( xEi ). To implement CCCP for Kikuchi, we need to specify the direct subregions and the direct superregions. It can be seen that the direct subregions of bEi,Ei C D h, E Ei C D vE ,Ei C D E CD Eh E vE are bEi ,Ei C D h , bEi C D v,Ei C D h C D v , bEi,Ei C D v , and bEi C D h,Ei C D v C D h . The direct subregions of bEi,Ei C D h are bEi and bEi C D hE . Those of bEi,Ei C D v are bEi and bEi C D v . Of course, bEi has no direct (or indirect) subregions.
1718
A. L. Yuille
Figure 11: Kikuchi 2D spin glass. (Left) CCCP solution. (Center and right) Two GBP solutions with different initial conditions. We plot the beliefs bi ( xi D 0) for all nodes i on the 2D lattice. Observe the similarity of the results on this 50 £ 50 grid.
The direct superregions of bEi are bEi,Ei C D h , bEi¡D h,Ei , bEi,Ei C D v , and bEi¡D v,Ei . The direct superregions of bEi,Ei C D h are bEi,Ei C D h, and bEi¡D v,Ei C D h¡ E Ei C D vE ,Ei C D E D v,Ei,Ei C D E hE C D Ev E hE . E Similarly, the direct superregions of bEi,Ei C D v are bEi,Ei C D h,E Ei C D vE ,Ei C DE hE C DE vE and . bEi¡D h, E Ei,Ei C D vE ¡D h, E Ei C D Ev E 7 Discussion
It is known that BP and GBP algorithms are guaranteed to converge to the correct solution on graphs with no loops. We now argue that CCCP will also converge to the correct answer on such problems. As shown by Yedidia et al. (2000), the xed points of BP and GBP correspond to extrema of the Bethe and Kikuchi free energy. Hence, for graphs with no loops, there can be only a single extremum of the Bethe and Kikuchi free energies. These free energies are bounded below, and so the extremum must be a minimum. The free energy therefore has a single unique global minimum. CCCP is then guaranteed to nd it. We also emphasize the generality of the Kikuchi approximation. In this article, we considered distributions P ( x1 , . . . , xN |y) , which had pairwise interactions only (the highest-order terms were of form yij (xi , xj ) ). The Kikuchi approximation, however, can be extended to interactions of arbitrary high order (e.g., terms such as yi1 ,...iN ( xi1 , . . . , xiN )). Good results are reported for these cases (Yedidia, private communication, August 2001). Finally, we observe that both the Bethe and Kikuchi free energies can be reformulated in terms of dual variables (Strang, 1986). This duality is standard for quadratic optimization problems. The BP algorithm can be expressed in terms of the dynamics in this dual space. If we take the Bethe free energy and extremize with respect to the variables fbij g, fbi g, it is possible to solve directly for these variables in terms of the Lagrange multipliers flij g, fc ij g. We can then substitute this back into the
CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies
1719
Bethe free energy to obtain the dual energy: G (fc ij g, flij g) D ¡
X X
w ij ( xi , xj ) e
¡flij ( xj ) C lji ( xi )g ¡1 ¡c ij e e
i, j: i > j xi ,xj
C
¡
X i
( ni ¡ 1)
X
X
yi ( xi ) e
¡ n i1¡1
P j
lji ( xi ) ¡1
e
xi
c ij .
(7.1)
i, j: i > j
BP is given by the update rule, @g @e
¡lji ( xi )
(lji ( xi I t C 1) ) D ¡ @g
@c ij
(c ij ( t C 1) ) D ¡
@f ¡lji ( xi )
@e @f
@c ij
(lji ( xi I t) ) ,
(c ij (t) ) ,
(7.2)
where f (.) and g (.) are the rst and second terms in G. By standard properties of duality, extrema of the dual energy G correspond to extrema of the Bethe free energy. Unfortunately, minima do not correspond to maxima (which they would if the Bethe free energy was convex). This means that it is hard to obtain global convergence results by analyzing the dual energy. Similarly, we can calculate the dual of the Kikuchi free energy. This is of the form P X X l (x ) ¡1 ¡Er (xr ) ¡c r / cr ¡(1 / cr ) s2subd (r) r, s s (fc GK r g, flr,s g) D ¡ cr e e e e r2R
£e
(1/ cr )
xr
P
v2supd (r)
lv, r ( xr )
¡
X
c r.
(7.3)
r2R
8 Conclusion
This article introduced CCCP double-loop algorithms that are proved to converge to extrema of the Bethe and Kikuchi free energies. They are convergent alternatives to BP and GBP (whose main error mode is failure to converge). We showed that there are similarities between CCCP and BP and GBP. More precisely, the CCCP algorithm updates Lagrange parameter variables fl, c g while BP and GBP updates messages fmg—but the exponentials of the flg correspond to linear combinations of the messages (and the fc g are implicit in BP and GBP). In BP and GBP, the beliefs fbg are expressed in terms of the messages fmg, but for CCCP, they are expressed in terms of the fl, c g. The difference is that for BP and GBP, the update rules for the fmg
1720
A. L. Yuille
do not depend explicitly on the current estimation of the fbg. But the CCCP updates for fl, c g do depend on the current fbg, and the estimates of the fbg must be periodically reestimated. We can make BP and GBP and CCCP very similar by expressing the fbg as a function of the fl, c g (this is collapsing the double loop), which makes the algorithms very similar (but prevents the convergence proof from holding). Our computer simulations on spin glasses showed that the CCCP algorithms are stable, converge rapidly, and give solutions as good as, or better than, those found by BP and GBP. (Convergence rates of CCCP were similar to those of BP and GBP when ve iterations of the inner loop were used.) In particular, we found that BP often did not converge on 2D spin glasses and never converged at all on 3D spin glasses. GBP, however, did converge on 2D spin glasses provided inertia was used. Finally, we stress that the Bethe and Kikuchi free energies are variational techniques that can be applied to a range of inference and learning applications (Yedidia et al., 2000). They are better approximations than standard mean-eld theory and hence give better results on certain applications (Weiss, 2001). Mean-eld theory algorithms can be very effective on difcult optimization problems (see Rangarajan, Gold, & Mjolsness, 1996). But it is highly probable that using Bethe and Kikuchi approximations (and applying CCCP or BP and GBP algorithms) will perform well on a large range of applications. Acknowledgments
I acknowledge helpful conversations with James Coughlan and Huiying Shen. Yair Weiss pointed out a conceptual bug in the original dual formulation. Anand Rangarajan pointed out connections to Legendre transforms. Sabino Ferreira read the manuscript carefully, made many useful comments, and gave excellent feedback. Jonathan Yedidia pointed out a faulty assumption about the fcr g. Two referees gave helpful feedback. This work was supported by the National Institutes of Health (NEI) with grant number RO1-EY 12691-01. It was also supported by the National Science Foundation with award number IRI-9700446. References Arbib, M. (Ed.). (1995). The handbook of brain theory and neural networks. Cambridge, MA: MIT Press. Berrou, C., Glavieux, A., & Thitimajshima, P. (1993). Near Shannon limit errorcorrecting coding and decoding: turbo codes (I). In Proc. ICC’93 (pp. 1064– 1070). Geneva, Switzerland. Chiang, M., & Forney, Jr., C. D. (2001). Statistical physics, convex optimization and the sum product algorithm (LIDS Tech. Rep.). Cambridge, MA: Laboratory for Information and Decision Systems, MIT.
CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies
1721
Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Domb, C., & Green, M. S. (Eds.). (1972). Phase transitions and critical phenomena (Vol. 2). London: Academic Press. Forney, Jr., C. D. (2001). Codes on graphs: News and views. Paper presented at the 2001 Conference on Information Sciences and Systems, John Hopkins University, Baltimore, MD. Freeman, W. T., & Pasztor, E. C. (1999). Learning low level vision. In Proc. International Conference of Computer Vision (pp. 1182–1189). Los Alamitos, CA: IEEE Computer Society. Frey, B. (1998). Graphical models for pattern classication, data compression and channel coding. Cambridge, MA: MIT Press. Gallager, R. G. (1963). Low-density parity-checkcodes. Cambridge, MA: MIT Press. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley. Kosowsky, J. J. (1995).Flows suspending iterativealgorithms. Unpublished doctoral dissertation, Harvard University, Cambridge, MA. Kosowsky, J. J., & Yuille, A. L. (1994). The invisible hand algorithm: Solving the assignment problem with statistical physics. Neural Networks, 7, 477–490. Marcus, C. M., & Westervelt, R. M. (1989). Dynamics of iterated-map neural networks. Physical Review A, 40, 501–504. McEliece, R. J., Mackay, D. J. C., & Cheng, J. F. (1998). Turbo decoding as an instance of Pearl’s belief propagation algorithm. IEEE Journal on Selected Areas in Communication, 16(2), 140–152. Murphy, K. P., Weiss, Y., & Jordan, M. I. (1999). Loopy belief propagation for approximate inference: An empirical study. In Proceedings of Uncertainty in AI. San Mateo, CA: Morgan Kaufmann. Pearl, J. (1988). Probabilistic reasoning in intelligent systems. San Mateo, CA: Morgan Kaufmann. Rangarajan, A., Gold, S., & Mjolsness, E. (1996). A novel optimizing network architecture with applications. Neural Computation, 8(5), 1041–1060. Rangarajan, A., Yuille, A. L., Gold, S., & Mjolsness, E. (1996).A convergence proof for the softassign quadratic assignment problem. In Proceedings of NIPS’96. Snowmass, CO. Sinkhorn, R. (1964). A relationship between arbitrary positive matrices and doubly stochastic matrices. Ann. Math. Statist., 35, 876–879. Strang, G. (1986). Introduction to applied mathematics. Wellesley, MA: WellesleyCambridge Press. Teh, Y. W., & Welling, M. (2001). Passing and bouncing messages for generalized inference (Tech. Rep. GCTU 2001-001). London: Gatsby Computational Neuroscience Unit, University College, London. Wainwright, M., Jaakkola, T., & Willsky, A. (2001). Tree-based reparameterization framework for approximate estimation of stochastic processes on graphs with cycles (Tech. Rep. LIDS P-2510). Cambridge, MA: Laboratory for Information and Decision Systems, MIT. Waugh, F. R., & Westervelt, R. M. (1993). Analog neural networks with local competition: I. Dynamics and stability. Physical Review E, 47(6), 4524–4536.
1722
A. L. Yuille
Weiss, Y. (2001). Comparing the mean eld method and belief propagation for approximate inference in MRFs. In M. Oppor & D. Saad (Eds.), Advanced mean eld methods. Cambridge, MA: MIT Press. Yedidia, J. S., Freeman, W. T., & Weiss, Y. (2000). Bethe free energy, Kikuchi approximations and belief propagation algorithms. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12. Cambridge, MA: MIT Press. Yuille, A. L., & Kosowsky, J. J. (1994).Statistical physics algorithms that converge. Neural Computation, 6, 341–356. Yuille, A. L., & Rangarajan, A. (2001). The concave convex principle (CCCP). In T. K. Leen, T. G. Diettrich, & V. Tresp (Eds.), Advances in neural information processing systems, 13. Cambridge, MA: MIT Press. Received March 15, 2001; accepted November 19, 2001.
LETTER
Communicated by Barak Pearlmutter
Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent Nicol N. Schraudolph
[email protected] IDSIA, Galleria 2, 6928 Manno, Switzerland, and Institute of Computational Science, ETH Zentrum, 8092 Zurich, ¨ Switzerland We propose a generic method for iteratively approximating various second-order gradient steps—-Newton, Gauss-Newton, Levenberg-Marquardt, and natural gradient—-in linear time per iteration, using special curvature matrix-vector products that can be computed in O ( n ) . Two recent acceleration techniques for on-line learning, matrix momentum and stochastic meta-descent (SMD), implement this approach. Since both were originally derived by very different routes, this offers fresh insight into their operation, resulting in further improvements to SMD. 1 Introduction
Second-order gradient descent methods typically multiply the local gradient by the inverse of a matrix CN of local curvature information. Depending on the specic method used, this n £ n matrix (for a system with n parameters) may be the Hessian (Newton’s method), an approximation or modication thereof (e.g., Gauss-Newton, Levenberg-Marquardt), or the Fisher information (natural gradient—Amari, 1998). These methods may converge rapidly but are computationally quite expensive: the time complexity of common methods to invert CN is O ( n3 ), and iterative approximations cost at least O ( n2 ) per iteration if they compute CN ¡1 directly, since that is the time required just to access the n2 elements of this matrix. Note, however, that second-order gradient methods do not require CN ¡1 explicitly: all they need is its product with the gradient. This is exploited by Yang and Amari (1998) to compute efciently the natural gradient for multilayer perceptrons with a single output and one hidden layer: assuming independently and identically distributed (i.i.d.) gaussian input, they explicitly derive the form of the Fisher information matrix and its inverse for their system and nd that the latter ’s product with the gradient can be computed in just O ( n ) steps. However, the resulting algorithm is rather complicated and does not lend itself to being extended to more complex adaptive systems (such as multilayer perceptrons with more than one output or hidden layer), curvature matrices other than the Fisher information, or inputs that are far from i.i.d. gaussian. c 2002 Massachusetts Institute of Technology Neural Computation 14, 1723– 1738 (2002) °
1724
Nicol N. Schraudolph
In order to set up a general framework that admits such extensions (and indeed applies to any twice-differentiable adaptive system), we abandon the notion of calculating the exact second-order gradient step in favor of an iterative approximation. The following iteration efciently approaches vE D CN ¡1 uE for an arbitrary vector uE (Press, Teukolsky, Vetterling, & Flannery, 1992, page 57):
vE 0 D 0I
(8t ¸ 0)
vE t C 1 D vE t C D ( uE ¡ CN vE t ) ,
(1.1)
where D is a conditioning matrix chosen close to CN ¡1 if possible. Note that if we restrict D to be diagonal, all operations in equation 1.1 can be performed in O ( n ) time, except (one would suppose) for the matrix-vector product CN vE t . In fact, there is an O ( n ) method for calculating the product of an n £ n matrix with an arbitrary vector—if the matrix happens to be the Hessian of a system whose gradient can be calculated in O ( n ), as is the case for most adaptive architectures encountered in practice. This fast Hessian-vector product (Pearlmutter, 1994; Werbos, 1988; Møller, 1993) can be used in conjunction with equation 1.1 to create an efcient, iterative O (n ) implementation of Newton’s method. Unfortunately, Newton’s method has severe stability problems when used in nonlinear systems, stemming from the fact that the Hessian may be ill-conditioned and does not guarantee positive deniteness. Practical second-order methods therefore prefer measures of curvature that are better behaved, such as the outer product (Gauss-Newton) approximation of the Hessian, a model-trust region modication of the same (Levenberg, 1944; Marquardt, 1963), or the Fisher information. Below, we dene these matrices in a maximum likelihood framework for regression and classication and describe O ( n ) algorithms for computing the product of any of them with an arbitrary vector for neural network architectures. These curvature matrix-vector products are, in fact, cheaper still than the fast Hessian-vector product and can be used in conjunction with equation 1.1 to implement rapid, iterative, optionally stochastic O (n ) variants of second-order gradient descent methods. The resulting algorithms are very general, practical (i.e., sufciently robust and efcient), far less expensive than the conventional O ( n2 ) and O (n3 ) approaches, and—with the aid of automatic differentiation software tools—comparatively easy to implement (see section 4). We then examine two learning algorithms that use this approach: matrix momentum (Orr, 1995; Orr & Leen, 1997) and stochastic meta-descent (Schraudolph, 1999b, 1999c; Schraudolph & Giannakopoulos, 2000). Since both methods were derived by entirely different routes, viewing them as implementations of iteration 1.1 will provide additional insight into their operation and suggest new ways to improve them.
Curvature Matrix-Vector Products
1725
2 Denitions and Notation Network. A neural network with m inputs, n weights, and o linear out-
puts is usually regarded as a mapping Rm ! Ro from an input pattern xE to the corresponding output yE , for a given vector w E of weights. Here we formalize such a network instead as a mapping N : Rn ! Ro from weights to outputs (for given inputs), and write yE D N ( w E ). To extend this formalism to networks with nonlinear outputs, we dene the output nonlinearity M: Ro ! Ro and write zE D M ( yE ) D M ( N ( w E ) ) . For networks with linear outputs, M will be the identity map. Loss function . We consider neural network learning as the minimization of a scalar loss function L : Ro ! R dened as the log-likelihood L (zE ) ´ ¡ log Pr( Ez ) of the output Ez under a suitable statistical model (Bishop, 1995). For supervised learning, L may also implicitly depend on given targets Ez¤ for the outputs. Formally, the loss can now be regarded as a function L (M ( N ( w E ) ) ) of the weights, for a given set of inputs and (if supervised) targets. Jacobian and gradient. The Jacobian JF of a function F : Rm ! Rn is the £ n m matrix of partial derivatives of the outputs of F with respect to its inputs. For a neural network dened as above, the gradient—the vector gE of derivatives of the loss with respect to the weights—is given by
gE ´
@ E @w
L (M(N
(w E ) ) ) D JL 0
±M±N
D JN 0 J M 0 JL 0 ,
(2.1)
where ± denotes function composition and 0 the matrix transpose.
Matching loss functions. We say that the loss function L matches the E for some A and bE not dependent output nonlinearity M iff JL0 ±M D AEz C b, on w. E 1 The standard loss functions used in neural network regression and classication—sum-squared error for linear outputs and cross-entropy error for softmax or logistic outputs—are all matching loss functions with A D I (the identity matrix) and bE D ¡Ez¤ , so that JL0 ±M D Ez ¡Ez¤ (Bishop, 1995, chapter 6). This will simplify some of the calculations described in section 4. Hessian. The instantaneous Hessian HF of a scalar function F : Rn ! R is the n £ n matrix of second derivatives of F ( w E ) with respect to its inputs E w:
HF ´
@ JF
, E0 @w
i.e.,
( HF ) ij D
E) @2 F ( w @wi @wj
.
(2.2)
1 For supervised learning, a similar if somewhat more restrictive denition of matching loss functions is given by Helmbold, Kivinen, and Warmuth (1996) and Auer, Herbster, and Warmuth (1996).
1726
Nicol N. Schraudolph
For a neural network as dened above, we abbreviate H ´ HL ±M±N . The N is obtained by taking the expectation Hessian proper, which we denote H, N of H over inputs: H ´ hHixE . For matching loss functions, HL ±M D AJM D JM 0 A 0 . Fisher information . The instantaneous Fisher information matrix FF of a scalar log-likelihood function F : Rn ! R is the n £ n matrix formed by the outer product of its rst derivatives:
0 F F ´ JF JF ,
i.e.,
( FF ) ij D
E ) @F ( w E) @F ( w @wi
@wj
.
(2.3)
Note that FF always has rank one. Again, we abbreviate F ´ FL ±M±N D gE gE 0 . The Fisher information matrix proper, FN ´ hFixE , describes the geometric structure of weight space (Amari, 1985) and is used in the natural gradient descent approach (Amari, 1998). 3 Extended Gauss-Newton Approximation Problems with the Hessian. The use of the Hessian in second-order gradient descent for neural networks is problematic. For nonlinear systems, HN is not necessarily positive denite, so Newton’s method may diverge or even take steps in uphill directions. Practical second-order gradient methods should therefore use approximations or modications of the Hessian that are known to be reasonably well behaved, with positive semideniteness as a minimum requirement. Fisher information. One alternative that has been proposed is the Fisher information matrix FN (Amari, 1998), which, being a quadratic form, is positive semidenite by denition. On the other hand, FN ignores all second-order interactions between system parameters, thus throwing away potentially useful curvature information. By contrast, we shall derive an approximation of the Hessian that is provably positive semidenite even though it does make use of second derivatives to model Hessian curvature better. Gauss-Newton. An entire class of popular optimization techniques for nonlinear least-squares problems, as implemented by neural networks with linear outputs and sum-squared loss function, is based on the well-known Gauss-Newton (also referred to as linearized, outer product, or squared Jacobian) approximation of the Hessian. Here we extend the Gauss-Newton approach to other standard loss functions—in particular, the cross-entropy loss used in neural network classication—in such a way that even though
Curvature Matrix-Vector Products
1727
some second-order information is retained, positive semideniteness can still be proved. Using the product rule, the instantaneous Hessian of our neural network model can be written as
HD
@ E0 @w
( JL
) D JN 0 H L
±M JN
±M JN
C
o X iD 1
( JL
±M ) i HN
i
,
(3.1)
where i ranges over the o outputs of N , with N i denoting the subnetwork that produces the ith output. Ignoring the second term above, we dene the extended, instantaneous Gauss-Newton matrix: G ´ JN 0 H L
±M JN
.
(3.2)
Note that G has rank · o (the number of outputs) and is positive semidefinite, regardless of the choice of architecture for N , provided that HL ±M is. G models the second-order interactions among N ’s outputs (via HL ±M ) while ignoring those arising within N itself (HN i ). This constitutes a compromise between the Hessian (which models all second-order interactions) and the Fisher information (which ignores them all). For systems with a single linear output and sum-squared error, G reduces to F. For multiple outputs, it provides a richer (rank(G) · o versus rank(F) D 1) model of Hessian curvature. Standard Loss Functions. For the standard loss functions used in neural network regression and classication, G has additional interesting properties: First, the residual JL0 ±M D Ez ¡ Ez¤ vanishes at the optimum for realizable problems, so that the Gauss-Newton approximation, equation 3.2, of the Hessian, equation 3.1, becomes exact in this case. For unrealizable problems, the residuals at the optimum have zero mean; this will tend to make the last term in equation 3.1 vanish in expectation, so that we can still assume GN ¼ HN near the optimum. Second, in each case we can show that HL ±M (and hence G, and hence N is positive semidenite. For linear outputs with sum-squared loss—that G) is, conventional Gauss-Newton—HL ±M D JM is just the identity I; for independent logistic outputs with cross-entropy loss, it is diag[diag( Ez ) (1 ¡ Ez )], positive semidenite because (8i) 0 < zi < 1. For softmax output with cross-entropy loss, we have HL ±M D diag(zE ) ¡ EzEz0 , which is also positive
1728
Nicol N. Schraudolph
semidenite since (8i) zi > 0 and (8vE 2 R
o)
0
P
i zi
0
vE [diag( Ez) ¡ zE Ez ]vE D D
X i
zi v2i ¡ 2
D
i
zi @vi ¡
X
Á zi v2i
i
Á X
!0
zi vi @
i
0 X
D 1, and thus
X j
¡ X j
X i
!2 z i vi
1
0
zj vj A C @
X
12 zj vj A
j
12 zj vj A ¸ 0.
(3.3)
Model-Trust Region. As long as G is positive semidenite—as proved above for standard loss functions—the extended Gauss-Newton algorithm will not take steps in uphill directions. However, it may still take very large (even innite) steps. These may take us outside the model-trust region, the area in which our quadratic model of the error surface is reasonable. Model-trust region methods restrict the gradient step to a suitable neighborhood around the current point. One popular way to enforce a model-trust region is the addition of a small diagonal term to the curvature matrix. Levenberg (1944) suggested N Marquardt (1963) elaborated the adding lI to the Gauss-Newton matrix G; N additive term to ldiag(G ). The Levenberg-Marquardt algorithm directly inverts the resulting curvature matrix; where affordable (i.e., for relatively small systems), it has become today’s workhorse of nonlinear least-squares optimization. 4 Fast Curvature Matrix-Vector Products
We now describe algorithms that compute the product of F, G, or H with an arbitrary n-dimensional vector vE in O ( n ). They can be used in conjunction with equation 1.1 to implement rapid and (if so desired) stochastic versions of various second-order gradient descent methods, including Newton’s method, Gauss-Newton, Levenberg-Marquardt, and natural gradient descent. 4.1 The Passes. The fast matrix-vector products are all constructed from the same set of passes in which certain quantities are propagated through all or part of our neural network model (comprising N , M, and L ) in forward or reverse direction. For implementation purposes, it should be noted that automatic differentiation software tools2 can automatically produce these passes from a program implementing the basic forward pass f0 . 2
See http://www-unix.mcs.anl.gov/autodiff/.
Curvature Matrix-Vector Products
1729
f 0 . This is the ordinary forward pass of a neural network, evaluating the function F ( w E ) it implements by propagating activity (i.e., intermediate results) forward through F . r 1 . The ordinary backward pass of a neural network, calculating 0 JF uE by propagating the vector uE backward through F . This pass uses intermediate results computed in the f0 pass. f 1 . Following Pearlmutter (1994), we dene the Gateaux derivative
RvE (F ( w E)) ´
E C rv E) @F ( w @r
D JF vE ,
(4.1)
rD 0
which describes the effect on a function F (w E ) of a weight perturbation in the direction of vE . By pushing RvE , which obeys the usual rules for differential operators, down into the equations of the forward pass f0, one obtains an efcient procedure, to calculate JF vE from vE . (See Pearlmutter, 1994, for details and examples.) This f1 pass uses intermediate results from the f0 pass.
r 2 . When the RvE operator is applied to the r1 pass for a scalar function F , one obtains an efcient procedure for calculating the Hessian0 ) vector product HF vE D RvE ( JF . (See Pearlmutter, 1994, for details and examples.) This r2 pass uses intermediate results from the f0 , f1 , and r1 passes. 4.2 The Algorithms. The rst step in all three matrix-vector products is computing the gradient gE of our neural network model by standard backpropagation:
Gradient. gE ´ JL0 ±M±N is computed by an f0 pass through the entire model (N , M, and L ), followed by an r1 pass propagating uE D 1 back through the entire model (L , M, then N ). For matching loss functions, E we can limit the forward pass to N there is a shortcut: since JL0 ±M D AEz C b, and M (to compute Ez), then r1 -propagate uE D AzE C bE back through just N . Fisher Information. To compute FvE D gE gE 0 vE , multiply the gradient gE by the inner product between gE and vE . If there is no random access to gE or vE — that is, its elements can be accessed only through passes like the above—the scalar gE 0 vE can instead be calculated by f1 -propagating vE forward through the model (N , M, and L ). This step is also necessary for the other two matrix-vector products. Hessian. After f1 -propagating vE forward through N , M, and L , r2 -propagate RvE (1) D 0 back through the entire model (L , M, then N ) to obtain H vE D RvE ( gE ) (Pearlmutter, 1994). For matching loss functions, the shortcut is
1730
Nicol N. Schraudolph
Table 1: Choice of Curvature Matrix C for Various Gradient Descent Methods, N vE , and Passes Needed to Compute Gradient Eg and Fast Matrix-Vector Product C Associated Cost (for a Multilayer Perceptron) in Flops per Weight and Pattern. Pass Method
result: cost:
CD
name
I F G H
simple gradient natural gradient Gauss-Newton Newton’s method
f0
r1
F 2 p p p p
0
JF uE 3 p p pp p
f1
r2
JF vE 4
HF vE 7
p (p ) p
p
Cost
(for gE and CN vE ) 6 10 14 18
to f1 -propagate vE through just N and M to obtain R vE ( Ez) , then r2 -propagate RvE ( JL0 ±M ) D ARvE ( Ez) back through just N . Gauss-Newton. Following the f1 pass, r2 -propagate RvE (1) D 0 back through L and M to obtain RvE ( JL0 ±M ) D HL ±M JN vE , then r1 -propagate that back through N , giving GvE . For matching loss functions, we do not require an r2 pass. Since G D JN0 HL
±M J N
D JN 0 J M 0 A 0 JN ,
(4.2)
we can limit the f1 pass to N , multiply the result with A0 , then r1 -propagate it back through M and N . Alternatively, one may compute the equivalent GvE D JN0 AJM JN vE by continuing the f1 pass through M, multiplying with A, then r1 -propagating back through N only. Batch Average. To calculate the product of a curvature matrix CN ´ hCixE , where C is one of F, G, or H, with vector vE , average the instantaneous product CvE over all input patterns xE (and associated targets Ez¤ , if applicable) while holding vE constant. For large training sets or nonstationary streams of data, it is often preferable to estimate CN vE by averaging over “mini-batches” of (typically) just 5 to 50 patterns. 4.3 Computational Cost. Table 1 summarizes, for a number of gradient descent methods, their choice of curvature matrix C, the passes needed (for a matching loss function) to calculate both the gradient gE and the fast matrix-vector product CN vE , and the associated computational cost in terms of oating-point operations (ops) per weight and pattern in a multilayer perceptron. These gures ignore certain optimizations (e.g., not propagating gradients back to the inputs) and assume that any computation at the network’s nodes is dwarfed by that required for the weights. Computing both gradient and curvature matrix-vector product is typically about two to three times as expensive as calculating the gradient alone.
Curvature Matrix-Vector Products
1731
In combination with iteration 1.1, however, one can use the O (n ) matrixvector product to implement second-order gradient methods whose rapid convergence more than compensates for the additional cost. We describe two such algorithms in the following section. 5 Rapid Second-Order Gradient Descent
We know of two neural network learning algorithms that combine the O ( n ) curvature matrix-vector product with iteration 1.1 in some form: matrix momentum (Orr, 1995; Orr & Leen, 1997) and our own stochastic metadescent (Schraudolph, 1999b, 1999c; Schraudolph & Giannakopoulos, 2000). Since both of these were derived by entirely different routes, we gain fresh insight into their operation by examining how they implement equation 1.1. 5.1 Stochastic Meta-Descent. Stochastic meta-descent (SMD—Schraudolph, 1999b, 1999c) is a new on-line algorithm for local learning rate adaptation. It updates the weights w E by the simple gradient descent: E tC1 D w E t ¡ diag( pEt ) gE . w
(5.1)
The vector pE of local learning rates is adapted multiplicatively, ³
pEt D diag( pEt¡1 ) max
´ 1 ) C , 1 m diag( vE t gE , 2
(5.2)
using a scalar meta-learning rate m . Finally, the auxiliary vector vE used in equation 5.2 is itself updated iteratively via vE t C 1 D lvE t C diag( pEt ) ( gE ¡ lCvE t ) ,
(5.3)
where 0 · l · 1 is a forgetting factor for nonstationary tasks. Although derived as part of a dual gradient descent procedure (minimizing loss with respect to both w E and p), E equation 5.3 implements an interesting variation of equation 1.1. SMD thus employs rapid second-order techniques indirectly to help adapt local learning rates for the gradient descent in weight space. Linearization. The learning rate update, equation 5.2, minimizes the system’s loss with respect to pE by exponentiated gradient descent (Kivinen & Warmuth, 1995), but has been relinearized in order to avoid the computationally expensive exponentiation operation (Schraudolph, 1999a). The particular linearization used, eu ¼ max(%, 1 C u ) , is based on a rst-order Taylor expansion about u D 0, bounded below by 0 < % < 1 so as to safeguard against unreasonably small (and, worse, negative) multipliers for p. E The value of % determines the maximum permissible learning rate reduction; we follow many other step-size control methods in setting this to % D 12 , the
1732
Nicol N. Schraudolph
ratio between optimal and maximum stable step size in a symmetric bowl. Compared to direct exponentiated gradient descent, our linearized version, equation 5.2, thus dampens radical changes (in both directions) to pE that may occasionally arise due to the stochastic nature of the data. Diagonal, Adaptive Conditioner. For l D 1, SMD’s update of vE , equation 5.3 implements equation 1.1 with the diagonal conditioner D D diag( pE) . Note that the learning rates pE are being adapted so as to make the gradient step diag( pE) gE as effective as possible. A well-adapted pE will typically make this step similar to the second-order gradient CN ¡1 gE . In this restricted sense, we can regard diag( pE ) as an empirical diagonal approximation of CN ¡1 , making it a good choice for the conditioner D in iteration 1.1. Initial Learning Rates. Although SMD is very effective at adapting local learning rates to changing requirements, it is nonetheless sensitive to their initial values. All three of its update rules rely on pE for their conditioning, so initial values that are very far from optimal are bound to cause problems: divergence if they are too high, lack of progress if they are too low. A simple architecture-dependent technique such as tempering (Schraudolph & Sejnowski, 1996) should usually sufce to initialize pE adequately; the ne tuning can be left to the SMD algorithm. Model-Trust Region. For l < 1, the stochastic xpoint of equation 5.3 is no longer vE ! C ¡1 gE , but rather vE ! [lC C (1 ¡ l) diag( pE ) ¡1 ]¡1 gE .
(5.4)
This clearly implements a model-trust region approach, in that a diagonal matrix is being added (in small proportion) to C before inverting it. Moreover, the elements along the diagonal are not all identical as in Levenberg’s (1944) method, but scale individually as suggested by Marquardt (1963). The scaling factors are determined by 1 / pE rather than diag( CN ) , as the LevenbergMarquardt method would have it, but these two vectors are related by our above argument that pE is a diagonal approximation of CN ¡1 . For l < 1, SMD’s iteration 5.3 can thus be regarded as implementing an efcient stochastic variant of the Levenberg-Marquardt model-trust region approach. Benchmark Setup. We illustrate the behavior of SMD with empirical data obtained on the “four regions” benchmark (Singhal & Wu, 1989): a fully connected feedforward network N with two hidden layers of 10 units each (see Figure 1, right) is to classify two continuous inputs in the range [¡1, 1] into four disjoint, nonconvex regions (see Figure 1, left). We use the standard softmax output nonlinearity M with matching cross-entropy loss L , metalearning rate m D 0.05, initial learning rates pE0 D 0.1, and a hyperbolic
Curvature Matrix-Vector Products
1733
Figure 1: The four regions task (left), and the network we trained on it (right).
tangent nonlinearity on the hidden units. For each run, the 184 weights (including bias weights for all units) are initialized to uniformly random values in the range [¡0.3, 0.3]. Training patterns are generated on-line by drawing independent, uniformly random input samples; since each pattern is seen only once, the empirical loss provides an unbiased estimate of generalization ability. Patterns are presented in mini-batches of 10 each so as to reduce the computational overhead associated with SMD’s parameter updates 5.1, 5.2, and 5.3.3 Curvature Matrix. Figure 2 shows loss curves for SMD with l D 1 on the four regions problem, starting from 25 different random initial states, using the Hessian, Fisher information, and extended Gauss-Newton matrix, respectively, for C in equation 5.3. With the Hessian (left), 80% of the runs diverge—most of them early on, when the risk that H is not positive denite is greatest. When we guarantee positive semideniteness by switching to the Fisher information matrix (center), the proportion of diverged runs drops to 20%; those runs that still diverge do so only relatively late. Finally, for our extended Gauss-Newton approximation (right), only a single run diverges, illustrating the benet of retaining certain second-order terms while preserving positive semideniteness. (For comparison, we cannot get matrix momentum to converge at all on anything as difcult as this benchmark.) Stability. In contrast to matrix momentum, the high stochasticity of vE affects the weights in SMD only indirectly, being buffered—and largely 3 In exploratory experiments, comparative results when training fully on-line (i.e., pattern by pattern) were noisier but not substantially different.
1734
Nicol N. Schraudolph
Figure 2: Loss curves for 25 runs of SMD with l D 1, when using the Hessian (left), the Fisher information (center), or the extended Gauss-Newton matrix (right) for C in equation 5.3. Vertical spikes indicate divergence.
averaged out—by the incremental update 5.2 of learning rates p. E This makes SMD far more stable, especially when G is used as the curvature matrix. Its residual tendency to misbehave occasionally can be suppressed further by slightly lowering l so as to create a model-trust region. By curtailing the memory of iteration 5.3, however, this approach can compromise the rapid convergence of SMD. Figure 3 illustrates the resulting stability-performance trade-off on the four regions benchmark: When using the extended Gauss-Newton approximation, a small reduction of l to 0.998 (solid line) is sufcient to prevent divergence, at a moderate cost in performance relative to l D 1 (dashed, plotted up to the earliest point of divergence). When the Hessian is used, by contrast, l must be set as low as 0.95 to maintain stability, and convergence is slowed much further (dashdotted). Even so, this is still signicantly faster than the degenerate case of l D 0 (dotted), which in effect implements IDD (Harmon & Baird, 1996), to our knowledge the best on-line method for local learning rate adaptation preceding SMD. From these experiments, it appears that memory (i.e., l close to 1) is key to achieving the rapid convergence characteristic of SMD. We are now investigating more direct ways to keep iteration 5.3 under control, aiming to ensure the stability of SMD while maintaining its excellent performance near l D 1. 5.2 Matrix Momentum. The investigation of asymptotically optimal adaptive momentum for rst-order stochastic gradient descent (Leen & Orr, 1994) led Orr (1995) to propose the following weight update: E t C1 D w Et C v E tC 1 , w
vE t C 1 D vE t ¡ m (%t gE C CvE t ) ,
(5.5)
N largest eigenvalue, where m is a scalar constant less than the inverse of C’s and %t a rate parameter that is annealed from one to zero. We recognize
Curvature Matrix-Vector Products
1735
Figure 3: Average loss over 25 runs of SMD for various combinations of curvature matrix C and forgetting factor l. Memory (l ! 1) accelerates convergence over the conventional memory-less case l D 0 (dotted) but can lead to instability. With the Hessian H, all 25 runs remain stable up to l D 0.95 (dot-dashed line); using the extended Gauss-Newton matrix G pushes this limit up to l D 0.998 (solid line). The curve for l D 1 (dashed line) is plotted up to the earliest point of divergence.
equation 1.1 with scalar conditioner D D m and stochastic xed point vE ! ¡%t C¡1 gE ; thus, matrix momentum attempts to approximate partial secondorder gradient steps directly via this fast, stochastic iteration. Rapid Convergence. Orr (1995) found that in the late, annealing phase of learning, matrix momentum converges at optimal (second-order) asymptotic rates; this has been conrmed by subsequent analysis in a statistical mechanics framework (Rattray & Saad, 1999; Scarpetta, Rattray, & Saad, 1999). Moreover, compared to SMD’s slow, incremental adaptation of learning rates, matrix momentum’s direct second-order update of the weights promises a far shorter initial transient before rapid convergence sets in. Matrix momentum thus looks like the ideal candidate for a fast O (n ) stochastic gradient descent method. Instability. Unfortunately matrix momentum has a strong tendency to diverge for nonlinear systems when far from an optimum, as is the case during the search phase of learning. Current implementations therefore rely on simple (rst-order) stochastic gradient descent initially, turning on matrix momentum only once the vicinity of an optimum has been reached (Orr,
1736
Nicol N. Schraudolph
1995; Orr & Leen, 1997). The instability of matrix momentum is not caused by lack of semideniteness on behalf of the curvature matrix: Orr (1995) used the Gauss-Newton approximation, and Scarpetta et al. (1999) reached similar conclusions for the Fisher information matrix. Instead, it is thought to be a consequence of the noise inherent in the stochastic approximation of the curvature matrix (Rattray & Saad, 1999; Scarpetta et al., 1999). Recognizing matrix momentum as implementing the same iteration 1.1 as SMD suggests that its stability might be improved in similar ways— specically, by incorporating a model-trust region parameter l and an adaptive diagonal conditioner. However, whereas in SMD such a conditioner was trivially available in the vector pE of local learning rates, here it is by no means easy to construct, given our restriction to O ( n ) algorithms, which are affordable for very large systems. We are investigating several routes toward a stable, adaptively conditioned form of matrix momentum.
6 Summ ary
We have extended the notion of Gauss-Newton approximation of the Hessian from nonlinear least-squares problems to arbitrary loss functions, and shown that it is positive semidenite for the standard loss functions used in neural network regression and classication. We have given algorithms that compute the product of either the Fisher information or our extended Gauss-Newton matrix with an arbitrary vector in O ( n ), similar to but even cheaper than the fast Hessian-vector product described by Pearlmutter (1994). We have shown how these fast matrix-vector products may be used to construct O ( n ) iterative approximations to a variety of common secondorder gradient algorithms, including the Newton, natural gradient, GaussNewton, and Levenberg-Marquardt steps. Applying these insights to our recent SMD algorithm (Schraudoph, 1999b)—specically, replacing the Hessian with our extended Gauss-Newton approximation—resulted in improved stability and performance. We are now investigating whether matrix momentum (Orr, 1995) can similarly be stabilized though the incorporation of an adaptive diagonal conditioner and a model-trust region parameter.
Acknowledgments
I thank Jenny Orr, Barak Pearlmutter, and the anonymous reviewers for their helpful suggestions, and the Swiss National Science Foundation for the nancial support provided under grant number 2000–052678.97/ 1.
Curvature Matrix-Vector Products
1737
References Amari, S. (1985). Differential-geometricalmethods in statistics. New York: SpringerVerlag. Amari, S. (1998). Natural gradient works efciently in learning. Neural Computation, 10(2), 251–276. Auer, P., Herbster, M., & Warmuth, M. K. (1996). Exponentially many local minima for single neurons. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 316–322). Cambridge, MA: MIT Press. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press. Harmon, M. E., & Baird III, L. C. (1996). Multi-player residual advantage learning with general function approximation (Tech. Rep. No. WL-TR-1065). WrightPatterson Air Force Base, OH: Wright Laboratory, WL/AACF. Available online: www.leemon.com/papers/sim tech/sim tech.pdf. Helmbold, D. P., Kivinen, J., & Warmuth, M. K. (1996). Worst-case loss bounds for single neurons. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 309–315). Cambridge, MA: MIT Press. Kivinen, J., & Warmuth, M. K. (1995). Additive versus exponentiated gradient updates for linear prediction. In Proc. 27th Annual ACM Symposium on Theory of Computing (pp. 209–218). New York: Association for Computing Machinery. Leen, T. K., & Orr, G. B. (1994). Optimal stochastic search and adaptive momentum. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 477–484). San Mateo, CA: Morgan Kaufmann. Levenberg, K. (1944). A method for the solution of certain non-linear problems in least squares. Quarterly Journal of Applied Mathematics, 2(2), 164–168. Marquardt, D. (1963). An algorithm for least-squares estimation of non-linear parameters. Journal of the Society of Industrial and Applied Mathematics, 11(2), 431–441. Møller, M. F. (1993). Exact calculation of the product of the Hessian matrix of feedforward network error functions and a vector in O (n ) time. (Tech. Rep. No. DAIMI PB-432). Århus, Denmark: Computer Science Department, Århus University. Available on-line: www.daimi.au.dk/PB/432/PB432.ps.gz. Orr, G. B. (1995). Dynamics and algorithms for stochastic learning. Doctoral dissertation, Oregon Graduate Institute, Beaverton. Available on-line ftp://neural.cse.ogi.edu/pub/neural/papers/orrPhDch1-5.ps.Z, orrPhDch6-9.ps.Z. Orr, G. B., & Leen, T. K. (1997). Using curvature information for fast stochastic search. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9. Cambridge, MA: MIT Press. Pearlmutter, B. A. (1994). Fast exact multiplication by the Hessian. Neural Computation, 6(1), 147–160.
1738
Nicol N. Schraudolph
Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992).Numerical recipes in C: The art of scientic computing (2nd ed.). Cambridge: Cambridge University Press. Rattray, M., & Saad, D. (1999). Incorporating curvature information into on-line learning. In D. Saad (Ed.), On-line learning in neural networks (pp. 183–207). Cambridge: Cambridge University Press. Scarpetta, S., Rattray, M., & Saad, D. (1999). Matrix momentum for practical natural gradient learning. Journal of Physics A, 32, 4047–4059. Schraudolph, N. N. (1999a). A fast, compact approximation of the exponential function. Neural Computation, 11(4), 853–862. Available on-line: www.inf.ethz.ch/» schraudo/pubs/exp.ps.gz. Schraudolph, N. N. (1999b). Local gain adaptation in stochastic gradient descent. In Proceedings of the 9th International Conference on Articial Neural Networks (pp. 569–574). Edinburgh, Scotland: IEE. Available on-line: www.inf.ethz.ch/» schraudo/pubs/smd.ps.gz. Schraudolph, N. N. (1999c). Online learning with adaptive local step sizes. In M. Marinaro & R. Tagliaferri (Eds.), Neural Nets—WIRN Vietri-99:Proceedings of the 11th Italian Workshop on Neural Nets (pp. 151–156). Berlin: SpringerVerlag. Schraudolph, N. N., & Giannakopoulos, X. (2000). Online independent component analysis with local learning rate adaptation. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12 (pp. 789–795). Cambridge, MA: MIT Press. Available on-line: www.inf.ethz.ch/»schraudo/pubs/smdica.ps.gz. Schraudolph, N. N., & Sejnowski, T. J. (1996). Tempering backpropagation networks: Not all weights are created equal. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems (pp. 563–569). Cambridge, MA: MIT Press. Available on-line: www.inf.ethz.ch/»schraudo/pubs/nips95.ps.gz. Singhal, S., & Wu, L. (1989). Training multilayer perceptrons with the extended Kalman lter. In D. S. Touretzky (Ed.), Advances in neural informationprocessing systems (pp. 133–140). San Mateo, CA: Morgan Kaufmann. Werbos, P. J. (1988). Backpropagation: Past and future. In Proceedings of the IEEE International Conference on Neural Networks, San Diego, 1988 (Vol. I, pp. 343– 353). Long Beach, CA: IEEE Press. Yang, H. H., & Amari, S. (1998). Complexity issues in natural gradient descent method for training multilayer perceptrons. Neural Computation, 10(8), 2137– 2157. Received December 21, 2000; accepted November 12, 2001.
LETTER
Communicated by Tony Plate
Representation and Extrapolation in Multilayer Perceptrons Antony Browne
[email protected] School of Computing, Information Systems and Mathematics, London Guildhall University, London U.K. To give an adequate explanation of cognition and perform certain practical tasks, connectionist systems must be able to extrapolate. This work explores the relationship between input representation and extrapolation, using simulations of multilayer perceptrons trained to model the identity function. It has been discovered that representation has a marked effect on extrapolation. 1 Introduction
There are many different denitions of extrapolation used in different elds. The dictionary denitions of extrapolation are: “1: (mathematical) ‘To estimate (a value of a function or measurement ) beyond the values already known, by the extension of a curve’, and 2: ‘To infer (something not known) by using but not strictly deducing from the known facts’ (Hanks, McLeod, & Urdang, 1986). In connectionist modeling, there are two main uses of the term extrapolation. One is extrapolation in time, where the output of a network is taken as the prediction of a value (or values) a number of time steps ahead of the current time step (for an example of this usage, see Ensley & Nelson, 1992). An alternative denition of extrapolation with relevance to connectionism is “generalisation performance to novel inputs lying outside the dynamic range of the data that the network was trained on” (compare with interpolation, which is “generalisation performance to novel inputs lying within the dynamic range of the data the network was trained on”). Here, a vector X is in the dynamic range of a set of vectors Y1, Y2, . . . , YM if and only if for every dimension I if the vector space, xi , is within the range of values occurring in dimension I of the Y vectors (i.e., if and only if there exists j and k such that yji · xi · y ki ). This is the denition of extrapolation used in this article (but see section 4 for extrapolation in reference to convex hulls). Extrapolation is a difcult topic to study; given a set of training data points representing a function to be modeled, data outside the range of the training data may be well behaved, in that their distribution still follows that expected from the underlying function represented by the training data points, or badly behaved, in that something wholly unexpected happens to the distribution outside the range of the training data. c 2002 Massachusetts Institute of Technology Neural Computation 14, 1739– 1754 (2002) °
1740
Antony Browne
Although extrapolation is a difcult topic, it is an important one to study for the following reasons: Biological organisms have to extrapolate, as they often encounter data that lie beyond the range of their previous experience. To do nothing in these circumstances could conceivably threaten the survival of the organism. In many practical tasks (including the forecasting of economic indicators such as currency exchange rates), input values may uctuate beyond the ranges available when training the neural network model. In such situations, it may be sufcient to ignore the model (and then retrain it, including the new data). However, in other situations, it may be necessary for the model to make a best guess. Obviously, models with better extrapolation capabilities will be more desirable in these circumstances. Three major questions can be asked about extrapolation and neural networks: 1. How can the extrapolation properties of neural networks, such as multilayer perceptrons (MLPs) be improved? 2. What is the relationship between input representation and extrapolation? 3. How can neural networks be manipulated to give good models of human extrapolation performance? Extrapolation has been investigated in an experiment designed as a critique of eliminative connectionism, where Marcus (1998) demonstrated that MLPs had poor extrapolation abilities. In this experiment, strings of 10 binary digits were presented to an autoassociative MLP trained to perform the identity function. This network was trained on the 512 binary strings representing even numbers in the range 0, . . . , 1022 [0 0 0 0 0 0 0 0 0 0 . . . 1 1 1 1 1 1 1 1 1 0]. The test (extrapolation) set consisted of 512 odd numbers in the range 1, . . . , 1023 [i.e., 0 0 0 0 0 0 0 0 0 1 . . . 1 1 1 1 1 1 1 1 1 1]. Marcus found that the network did not extrapolate the identity function to odd numbers (even after trying models using different learning rates, numbers of hidden units, numbers of layers, and different presentation sequences of training examples). Instead, the network would respond incorrectly; for example, it would typically respond to the input [1 1 1 1 1 1 1 1 1 1] with the output [1 1 1 1 1 1 1 1 1 0]. Marcus points out that it is unsurprising that the network described above is unable to generalize outside the training space, as “in-
Representation and Extrapolation
1741
formally, a unit that never sees a given feature value is akin to a node that is freshly added to a network after training” (p. 263). In the above training patterns, the right-most input digit in the 10-bit string was always set to zero in the training set. Given the way that backpropagation (and many other error-driven training methods) operates, the weights from this input unit were not changed by the training algorithm, as this input takes no part in the performance of the network. When this input is set to 1 during presentation of the extrapolation test set, it is unsurprising that the autoassociator does not produce the correct (odd-numbered) outputs. Such a problem can be described as statistically neutral (Thornton, 1996), as the network is attempting to learn the value of the rightmost output when the nine other input units have a probability of 0.5 toward that output unit. Such problems are hard, as searching for dependencies between specic input variables (i.e., the other nine input units) and the right-most output unit is fruitless. Marcus (1998) goes on to state: Though the training space itself is dened by the nature of the input and output representations, the limitation applies to a wide variety of possible representations. Regardless of the input representation chosen (so long as the units represent a nite number of distinct values), there will always be features that lie outside the training space. Any feature that lies outside the training space will not be generalised properly—regardless of the learning rate, the number of hidden units or the number of hidden layers. (p. 265) It is the above statement that this article takes issue with and attempts to prove that the representation of a problem has an effect on the extrapolation properties of a connectionist system. It seems that a major problem with the model described above is the representational scheme used. The input representation that Marcus used can be thought of as being both localist and distributed. It can be thought of as localist in that there is one input node for one concept (in this case, each input node is representing a concept such as a power of two) but also as distributed because the whole number is being represented on all of the inputs. Such a representation, although designed to tackle an extrapolation task, makes it impossible for the network to extrapolate. If a particular input concept is absent during training (such as the right-most digit being always set to 0 in the bit strings described above), then its presence during testing, when the network has been subjected to the deciencies of backpropagation learning described above, will inevitably cause errors. A pertinent question is, If the representation used in this problem is changed, is it possible for the network to extrapolate?
1742
Antony Browne
2 Alternative Representations
Different kind of representational schemes exist, such as those using distributed representations (Smolensky, 1990). Such schemes look attractive with respect to the task described above, in that if a concept is distributed across many or all of the inputs, then its absence during training will be reected in many of the nodes making up the input representation, and its presence during the testing of extrapolation will also be reected in many of these nodes. There have been many attempts to dene distribution in connectionist representations, such as microfeatures (Hinton, 1990) and coarse coding (Rosenfeld & Touretzky, 1988). Perhaps the most formal notions of distribution have been given by Hinton, McClelland, and Rumelhart (1986) and van Gelder (1991) who described distributed representations with respect to their extendedness and superposition. For a representation to be extended, the things being represented must be represented over many units, or more generally over some relatively extended proportion of the available resources in the system. In the model described above, the representation of the concept even-odd (signied by the right-most bit of the bit string) would be extended if it were distributed over more than one of the input units. A representation can be said to be superpositional if it represents many items using the same resources. For the representation to be superpositional in the model described above, each input unit would somehow have to represent more than one of the concepts represented by the bit-string positions. Note that these qualities of extendedness and superposition seem attractive when attempting to perform the extrapolation task described. They imply that the network cannot omit the representation of a single concept (such as even-oddness) by ignoring a single unit when training, as the representation of that concept may be spread over many input units. Representations in a standard feedforward network can be both extended and superposed (Sharkey, 1992), as the representation of each input may be distributed over many hidden units, and each hidden unit may be representing several inputs. Such representations, and those developed by other means, have been of great interest to connectionists when trying to answer criticisms posited by symbolic articial intelligence researchers (see Niklasson & van Gelder, 1994; Browne, 1998a, 1998b; Browne & Sun, 2001). However, the intention in this article is not to model the symbol processing capabilities of symbolic articial intelligence systems; instead, it is to examine alternative representations with respect to their effects on extrapolation. 2.1 Generating Alternative Representations. One approach would be to train an autoassociative MLP to generate distributed input representations by using the hidden layer of this as input to an extrapolation model. However, in actuality, this approach would not work for the extrapolation task described above, as it would be necessary to train this autoassociator with both the training and extrapolation sets so as not to run into identi-
Representation and Extrapolation
1743
cal problems (caused by one input always being set to zero) to those described for the model that Marcus developed. What are needed are simple techniques for generating distributed representations, which can generate representations for a training set (without the model being exposed to an extrapolation set) and can then (independently) be used at a later date to generate an extrapolation set. Three schemes were examined: Random matrix transformation (Sch¨onemann, 1987; Plate, 1994). An input vector X with n elements (indexed x 0, x1 , . . . , x ( n¡1) ) is transformed by multiplying it by a matrix R of random numbers to give another vector Y. The MLP can be trained on the 512 Y input vectors produced from the training set and then tested on a set of extrapolation vectors generated using the same matrix R. Circular Convolution. This technique has been used by connectionist researchers for representing embedded structure in connectionist representations (Plate, 1995) and analogical processing (Plate, 2000). However, here there are two different uses of convolution. Instead of convolving one vector with another to generate embedding, here a vector is convolved with either a randomly generated vector (the same random vector being convolved with all of the vectors in the training and extrapolation set) or with itself to generate a distributed representation. Convolving a vector with itself appears to be an attractive scheme, as it can produce a distributed representation without the need for the existence of or manipulation of an external parameter (such as the distribution of values in the random matrix or vector in the two other schemes). To illustrate the case where a vector is convolved with itself, consider a vector X with n elements (indexed x 0 , x1 , . . . , x ( n¡1 ) ). The circular convolution Y (an n-dimensional vector) of X with itself (Y D XX) is given by equation 2.1, Y D [y 0 , y1, . . . , yn¡1 ]
(2.1)
where yj D
n¡1 X kD 0
xk x ( k C j ) ,
(2.2)
where the subscripts are modulo-n. For example, given the three element vector X (x0 , x1 , x2 ) shown in Figure 1, the circular convolution Y of X with itself is given by equation 2.3: y 0 D x 0x 0 C x2 x1 C x1 x2
(2.3)
y1 D x 0x1 C x1 x 0 C x2 x2 y2 D x 0x2 C x1 x1 C x2 x 0. Now supposing that the value of the element x1 is changed by an amount 0 4x1 to give a new value for this element (x1 ). It can be seen that the new
1744
Antony Browne
Figure 1: Generating a distributed representation Y with components y0 , y1 , y2 produced by the self-convolution of the vector X. (Adapted from Plate, 2000)
value of x1 will affect all three elements of the convolution Y, so a change in a single element of an input vector changes many elements of its selfconvolution. Such a representation is superposed (from the above gures, it can be seen that many of the x components contribute to each y component) and extended (as each x component contributes to many y components). Using an example taken from the data used by Marcus (1998), consider the even input pattern (from the training set), [0 0 0 0 0 0 0 0 1 0], that when convolved with itself becomes the pattern [0 0 0 0 0 0 1 0 0 0], and also consider the odd input pattern from the extrapolation set, [0 0 0 0 0 0 0 0 1 1], that when convolved with itself becomes [0 0 0 0 0 0 1 2 1 0]. For the original patterns, only one element (the right-most element) has changed value, whereas two elements can be seen to have changed value when comparing the self-convolutions. At the other end of the data set, this effect is more extreme. Consider the even input pattern (from the training
Representation and Extrapolation
1745
set), [1 1 1 1 1 1 1 1 1 0], and its self-convolution, [8 8 8 8 8 8 8 8 9 8], and also consider the odd input pattern from the extrapolation set, [1 1 1 1 1 1 1 1 1 1], that when convolved with itself becomes [10 10 10 10 10 10 10 10 10 10]. Here, although only one of the elements differs between the two original vectors, all 10 elements are different in their self-convolutions. 3 Simulations
To test the hypothesis that representation has a direct effect on extrapolation, ve autoassociative MLPs were constructed, each having 10 inputs and 10 outputs. Each was trained with the scaled conjugate gradients algorithm (Moller, ¨ 1993). Five sets of simulation were carried out. One of these used the localist representations favored by Marcus (described above) to train a series of MLPs to perform the identity task. Another used a similar input representation, except instead of the inputs being either 1/0, they were transformed to ¡1 / 1. Another set used random matrix transformation (with a new random matrix being used for each run). The remaining two sets of simulations used either convolution of input vectors with a random vector (with a new random vector being used for each run) or vector selfconvolutions as input. During the simulations, in different trials the number of hidden units and the initial values of both input to hidden and hidden to output weight values and ranges of individual networks were varied. Training was continued until a network attained 100% correct (rounded) outputs on the training set of even numbers or the maximum number (50,000) of epochs was reached. After training nished, the extrapolation set of odd numbers described above was fed through the network (after being suitably recoded). For the self-convolution network (which attempts to match integer outputs), a particular extrapolation example was judged correct if, when all the outputs generated by this case were rounded to the nearest integer, they were identical to the expected extrapolated number. For the random matrix transformation and random vector convolution networks (which attempt to match real outputs), a particular extrapolation example was judged correct if it was the closest (by Euclidean distance) output to that of the expected extrapolated number. 3.1 Results. The networks using untransformed inputs of 1/0 and ¡1 / 1 (unsurprisingly) always gave 0% extrapolation to the set of odd numbers (replicating the results of Marcus discussed above). Initially, the networks using random matrix transformation and convolution with a random vector gave erratic results. For example, on successive runs, the networks would veer from 0% correct extrapolation to 83% correct. It was discovered that both of these transformations are sensitive to the random matrix used and the random vector selected for convolution. If values in the matrix or vector were too large because the dynamic range of the extrapolation set exceeded that of the training set by a large amount, clipping was occurring at
1746
Antony Browne
Table 1: Extrapolation Set Percentage Correct, for the Three Transformations and Networks with Different Numbers of Hidden Units. Number of Hidden Units
Random Matrix
Random Convolution
Self-Convolution
10 9 8 7 6 5
100 100 44 2 2 0
100 100 81 72 44 38
100 37 14 3 0 0
the hidden layer. The asymptotes of the sigmoidal transfer function in the hidden layer were clipping hidden-layer activation values, and this truncation was adversely affecting extrapolation performance. When the random matrix ranges was reduced to 0.1 ¤ M, where M were random numbers drawn from a standard normal distribution (centered on zero), it was found that the performance of networks using these transformations improved drastically. The transformation based on convolution with a random vector proved more problematic. Even when the range and nature of the distribution the random vector was chosen from were changed, performance was still erratic. No such problems were encountered using self-convolution, which gave consistently good performance. With this transformation, the dynamic range of the test set varied slightly outside that of the training set, but not enough to encounter clipping. For networks using the three transformations described above, percentage correct values for different network congurations (averaged over 10 runs for each number of hidden units, using different initialization weights and ranges) are shown in Table 1. It can be seen from Table 1 that networks using alternative representations can show 100% correct generalization performance to the extrapolation set consisting of self-convolutions of the odd numbers. 3.2 Analysis. A criticism of these result could be that the (self-convolution) network with 10 hidden units is merely carrying out some form of symbolic copy operation (i.e., copying the value of a particular input unit to a single hidden unit) and then copying the value represented on this hidden unit directly to a single output. For the Marcus representation, Figure 2 shows a Hinton diagram (Hinton et al., 1986) of the input to hidden weights of a typical network trained using the original inputs, with 10 hidden units. It can be seen from this diagram that even using this representation, more than one large weight exists from a particular input to a particular hidden unit. Figure 3 shows a Hinton diagram of the hidden-to-output weights of the same network. From inspection of these two diagrams, it appears that even the network with the original representation is not implementing a symbolic
Representation and Extrapolation
1747
Figure 2: Hinton diagram of weights from input-to-hidden layer of network with original input representation. White signies positive weights, black negative. Hidden-layer bias inputs are in the 0th column.
copy operation where each input is mapped uniquely to a particular hidden unit and then to the respective output. Compare this to the Hinton diagrams of input-to-hidden weights (see Figure 4) and hidden-to-output weights (see Figure 5) for a network using the self-convolution transformation. It can be seen from Figures 4 and 5 that this network’s input-to-hidden and hidden-to-output weights are denitely not as would be expected if some form of symbolic copy operation were being carried out where each input value was being copied to a single hidden unit, and this value was being copied to a single output. There is not one large weight from each input unit to a corresponding single hidden unit, with all other weights set near zero. It appears from this diagram that each input has weights of reasonable magnitude to many hidden units, and each hidden unit is being inuenced by many inputs. 4 Discussion
The work described in this article has attempted to answer questions 1 and 2 from section 1: How can the extrapolation properties of neural networks
1748
Antony Browne
Figure 3: Hinton diagram of weights from hidden-to-output layer of network with original input representation. White signies positive weights, black negative. Output-layer bias inputs are in the 0th column.
(such as MLPs) be improved? and What is the relationship between input representation and extrapolation? It has demonstrated that the extrapolation properties of neural networks such as MLPs can be improved (at least for the identity task) when given an appropriate input representation. It seems that previous work has prematurely drawn erroneous conclusions on their inability to extrapolate based on the presentation of data with an inappropriate input representation. Using an input representation such as that used by Marcus, with the nth unit always set to the same value, is tantamount to telling the network that it exists in an n ¡ 1 dimensional space. It is unsurprising that the network performs badly with an extrapolation data set when the nth dimension appears during testing. In alternative representations, such as those described above, the nth dimension is active in forming the representation of the n ¡ 1 other dimensions. This suggests that for good extrapolation, the transform used must be such that any input value that varies in the transformed test set must also vary in the transformed training set. However, it is not true that the transform used must be such that the input space of the transformed training set spans the input
Representation and Extrapolation
1749
Figure 4: Hinton diagram of weights from input-to-hidden layer of network with self-convoluted input representation. White signies positive weights, black negative. Hidden-layer bias inputs are in the 0th column.
space of the transformed test set. Given the results for the three representations discussed above, where the test set dynamic ranges exceeded those of the training sets, it can be concluded that good extrapolation performance can be obtained if the dynamic range of the test set is not too far outside that of the training set. Dening what “too far” is when discussing test set range and the dangers of clipping by the hidden-layer transfer function will be the subject of further research. One concept discussed by Marcus (1998) (following work by Touretzky, 1991) is that of training independence, which has two components. Input independence means that when using a learning algorithm such as backpropagation, the amount that a given connection weight from unit i changes is determined by the algorithm to be dependent on the activation level xi (so that whenever xi is zero, there is no change in the weight). Output independence means that output units are also trained independently of one another, because in training algorithms such as backpropagation, the weights feeding a particular output unit are changed independently of the activation levels of other outputs.
1750
Antony Browne
Figure 5: Hinton diagram of weights from hidden-to-output layer of network with self-convoluted input representation. White signies positive weights, black negative. Output-layer bias inputs are in the 0th column.
Therefore, one way of viewing MLPs trained with such algorithms is that they consist of a set of independent classiers (Touretzky, 1991; Marcus, 1998); each output unit in such a network computes its own classication function. This is obviously the case in the network trained to model the identity function using the Marcus representation, as each input is independent of the other input, and each output is independent of the other outputs. An input that is always set to zero in the training phase will be ignored and its output classier always trained to produce a zero, regardless of the values of the other outputs. However, with the autoassociator using alternative representations, the situation is different. Inputs are dependent on one another, as a change in one input also changes some (or all) of the others. Outputs are also interdependent in a similar fashion. When a novel item from the extrapolation set is presented, it affects many inputs and therefore many outputs. One question that can be asked is, Does the new representation merely code the extrapolation problem as an interpolation problem? This is partially true, as (for example) the circular convolution procedure converts the original input and outputs to integer-valued vectors, where
Representation and Extrapolation
1751
there is no explicit representation of the concepts odd or even, and extrapolation between these concepts actually becomes interpolation on a series of these vectors. Given a problem, one can always nd a bad way of representing it to an MLP (as Marcus did) or code it in a form where it can be solved by such a network (even if this loses the explicit representation of the original function). In fact, the problem was not just one of interpolation of transformed vectors. Of the 512 convolved extrapolation test patterns, 503 lay within the dynamic range of the convolved training set. However, nine lay outside the dynamic range of the convolved training set and so can be considered true extrapolations. All of these nine are extrapolated correctly. Future experiments should investigate extrapolation by using representational transformations where a higher proportion (or all) of the transformed extrapolation set lies outside the dynamic range of the training set. Although the transformations discussed above are convenient ways of producing distributed representations with the aim of enhancing the extrapolation ability of a connectionist system, there are many others, and the ones given above may not always be optimal. For example, consider the XOR training set and its self-convolutions: [0, 0 ! 0, 0], [1, 0 ! 1, 0], [0, 1 ! 1, 0], [1, 1 ! 2, 2]. As both [0, 1] and [1, 0] are being mapped to the same convolved input, this would not be a suitable representation for a training set if the task involved discrimination between these two vectors. However, in the training and extrapolation sets for the 10-bit identity task described above, there were no identical vectors, and self-convolution proved an appropriate way of generating distributed representations. Alternative transforms suitable for some extrapolation tasks may have to be found. Regarding question 3 in section 1—How can neural networks be manipulated to give good models of human extrapolation performance?—no claims are made in this article that concepts are represented as self-convolutions in the human brain. However, it is also unlikely that concepts such as “odd number ” and “even number” are represented by their corresponding binary representations in the brain. The main demonstration of this work has been that the type of representation used affects the extrapolation ability of MLPs. Further work is needed to determine if cognitive systems (including the human brain and articial neural systems) can use representational transformation to change hard tasks into easier ones. Of course, the question could be asked, “You have demonstrated extrapolation is possible when data with one missing dimension are recoded using a distributed representation. What happens when the data have higher dimensionality or more than one dimension is missing?” This is a pertinent question, and further work will examine extrapolation for tasks with more than one dimension “hidden” from the network during training. Another interesting approach would be to look at the extrapolation properties of sparse distributed representations, such as those produced by content-dependent Thinning (Rachkovskij & Kussul, 2001).
1752
Antony Browne
Other researchers have compared the performance of human subjects with connectionist systems when required to extrapolate mathematical functions, such as linear, quadratic, and exponential functions (Busemeyer, McDaniel, & Byun, 1997; DeLosh, Busemeyer, & McDaniel, 1997; Erickson & Kruschke, 2001 and in press), and found purely connectionist explanations of these tasks untenable. In future work, it will prove interesting to observe the effect of alternative input representations on modeling these tasks. In this article, the denition of extrapolation used is “generalization performance to novel inputs lying outside the dynamic range of the data that the network was trained on,” as this seemed a natural denition to use when tackling the Marcus data set (i.e., a binary data set with rectangular convex hulls). However, a more restrictive denition of extrapolation exists: “generalisation to novel inputs lying outside the convex hull of the training data.” This more rigorous denition should be applied when using other data types, as a value can be within the dynamic range of the inputs but still be an extrapolation. This is often described as hidden extrapolation and can be illustrated by a simple example in two dimensions, where a point has coordinates less than the maximal coordinates in each dimension (hence, falling within the dynamic range of the data) yet because both coordinates are high is in fact an extrapolation. No points are made in this article regarding the adequacy of eliminative connectionism (as discussed by Marcus). Indeed, we show how connectionist implementation of variable binding and inference can be accomplished using the distributed representations formed on the hidden layers of MLPs (Browne & Sun, 1999) and inference (Browne & Sun, 2001). However, suggestions by other researchers, such as that localist representations are adequate for psychological modeling (Page, 2000), may have to be reviewed in response to the extrapolation results described in this article. Acknowledgments
I thank the two anonymous reviewers of this article for their insightful comments and Dmitri Rachkovskij for the code for performing CDT and related discussion. References Browne, A. (1998a). Detecting systematic structure in distributed representations. Neural Networks, 11(5), 815–824. Browne, A. (1998b). Performing a symbolic inference step on distributed representations. Neurocomputing, 19, 23–34. Browne, A., & Sun, R. (1999). Connectionist variable binding. Expert Systems: The International Journal of Knowledge Engineering and Neural Networks, 16(3), 189–207.
Representation and Extrapolation
1753
Browne, A., & Sun, R. (2001). Connectionist inference models. Neural Networks, 14(10), 1331–1355. Busemeyer, J., McDaniel, M. A., & Byun, E. (1997). The abstraction of intervening concepts from experience with multiple input-multiple output causal environments. Cognitive Psychology, 32, 1–48. DeLosh, E. L., Busemeyer, J. B., & McDaniel, M. A. (1997). Extrapolation: The sine qua non for abstraction in function learning. Journal of Experimental Psychology, 23(4), 968–986. Ensley, D., & Nelson, D. E. (1992). Extrapolation of mackey-glass data using cascade correlation. Simulation, 58(5), 333–339. Erickson, M. A., & Kruschke, J. A. (2001). Multiple representations in inductive category learning: Evidence of stimulus and-time-dependent representation. Manuscript submitted for publication. Erickson, M. A., & Kruschke, J. K. (in press). Rule based extrapolation in perceptual categorization. Psychonomic Bulletin and Review. Hanks, P., McLeod, W., & Urdang, L. (1986). Collins dictionary of the English language. London: Collins. Hinton, G. E. (1990). Mapping part-whole hierarchies into connectionist networks. Articial Intelligence, 46, 47–75. Hinton, G. E., McClelland, J. L., & Rumelhart, D. E. (1986). Distributed representations. In D. Rumelhart & J. McClelland (Eds.), Parallel distributed processing (Vol. 1, pp. 77–109). Cambridge, MA: MIT Press. Marcus, G. F. (1998).Rethinking eliminative connectionism. Cognitive Psychology, 37, 243–282. Moller, ¨ M. F. (1993). A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks, 6(4), 525–533. Niklasson, L., & van Gelder, T. (1994). Can connectionist models exhibit nonclassical structure sensitivity? In Proceedings of the Cognitive Science Society (pp. 664–669). Hillside, NJ: Erlbaum. Page, M. (2000). Connectionist modeling in psychology: A localist manifesto. Behavioral and Brain Sciences, 23(4), 443–512. Plate, T. (1994). Distributed representations and nested compositional structure. Unpublished doctoral dissertation, University of Toronto. Plate, T. (1995). Holographic reduced representations. IEEE Transactions on Neural Networks, 6(3), 623–641. Plate, T. (2000). Analogy retrieval and processing with distributed vector representations. Expert Systems: The International Journal of Knowledge Engineering and Neural Networks [Special issue], 17(1), 29–40. Rachkovskij, D. A., & Kussul, E. M. (2001). Binding and normalization of binary sparse distributed representations by context-dependent thinning. Neural Computation, 13(2), 411–452. Rosenfeld, R., & Touretzky, D. (1988). Coarse coded symbol memories and their properties. Complex Systems, 2, 463–484. Sch¨onemann, P. H. (1987). Some algebraic relations between involutions, convolutions, and correlations, with application to holographic memories. Biological Cybernetics, 56, 367–374.
1754
Antony Browne
Sharkey, N. (1992). The ghost in the hybrid—a study of uniquely connectionist representations. AISB Quarterly, 79, 10–16. Smolensky, P. (1990). Tensor product variable binding and the representation of symbolic structures in connectionist systems. Articial Intelligence, 46, 159– 216. Thornton, C. (1996). Parity, the problem that won’t go away. In G. M. Calla (Ed.), Proceedings of AI 96 (pp. 362–374). Toronto: Springer-Verlag. Touretzky, D. S. (1991). Connectionism and compositional semantics. In J. A. Barnden & J. B. Pollack (Eds.), High-level connectionist models (pp. 17–31). Hillside, NJ: Erlbaum. van Gelder, T. (1991). What is the “D” in “PDP”? A survey of the concept of distribution. In W. Ramsey, S. Stich, & D. E. Rumelhart (Eds.), Philosophy and connectionist theory (pp. 33–60). Hillside, NJ: Erlbaum. Received December 11, 2000; accepted November 20, 2001.
LETTER
Communicated by Gary Cottrell
Using Noise to Compute Error Surfaces in Connectionist Networks: A Novel Means of Reducing Catastrophic Forgetting Robert M. French
[email protected] Quantitative Psychology and Cognitive Science, Psychology Department, University of Li`ege, 4000 Li`ege, Belgium Nick Chater
[email protected] Institute for Applied Cognitive Science, Department of Psychology, University of Warwick Coventry, CV4 7AL, U.K. In error-driven distributed feedforward networks, new information typically interferes, sometimes severely, with previously learned information. We show how noise can be used to approximate the error surface of previously learned information. By combining this approximated error surface with the error surface associated with the new information to be learned, the network’s retention of previously learned items can be improved and catastrophic interference signicantly reduced. Further, we show that the noise-generated error surface is produced using only rst-derivative information and without recourse to any explicit error information. 1 Introduction
Everyone forgets but, thankfully, it is typically a gradual process. Neural networks, on the other hand, and especially those that develop highly distributed representations over a single set of weights, can suffer from severe and sudden forgetting. Almost all of the early solutions to this problem, called the problem of “catastrophic forgetting,” relied on learning algorithms that reduced the overlap of the network’s internal representations (see French, 1999, for a review) by making these representations sparser. This had the desired effect of reducing interference, with the obvious trade-off being a decrease in the network’s ability to generalize. A signicantly different approach, that made use of noise, was developed by Robins (1995). The idea was as follows. When a network that had previously learned a set of patterns had to learn a new set of patterns, a series of random patterns (i.e., noise) was input into the network and the associated output was collected, producing a series of pseudopatterns. These pseudopatterns, which reected the previously learned patterns, were then c 2002 Massachusetts Institute of Technology Neural Computation 14, 1755– 1769 (2002) °
1756
Robert M. French and Nick Chater
interleaved with the new patterns to be learned. This effectively decreased catastrophic forgetting of the originally learned patterns. The use of pseudopatterns will serve as the starting point for the algorithm developed in this article. Unlike Robins’s algorithm, however, we will use pseudopatterns to directly approximate the error surface associated with the original patterns. This approximated error surface will then be combined with the error surface associated with the new patterns, and gradient descent will be performed on the combined error surface. This will be shown to improve the network’s performance on catastrophic forgetting signicantly. 2 Measures of Forgetting
There are two standard measures of forgetting in connectionist models, both related to standard psychological measures. The rst is a simple error measurement. Suppose a rst set of patterns fPi : Ii ! Oi gN i D 1 has been learned to criterion by a network. A new set fQi : Ii ! Oi giMD1 is then learned to criterion. We measure the amount of network error produced by each of the patterns in the rst set. The second widely used measure of forgetting is an Ebbinghaus “savings” measure, rst applied to neural networks by Hetherington and SeidenM berg (1989). After learning fP i gN i D 1 and then fQi gi D 1 , we measure the number of epochs required to retrain the network to criterion on the initial training set fPi gN i D 1 . The faster the relearning, the less forgetting that is judged to have occurred. We will use both of these measures in the discussion that follows. 3 Overview of Hessian Pseudopattern Backpropagation (HPBP)
Any given set of patterns fPi : Ii ! Oi gN i D1 has an associated error surface, E ( w) , dened over the network’s weights. This means that for each possible combination of values of the network’s weights, there will be an overall error associated with the set of patterns (usually the sum of the squared errors produced by each individual pattern, Pi ). Learning the set of patterns fPi gN i D 1 is equivalent to the network’s nding a minimum—call it w 0 —of this error surface. When a new pattern, Pnew , is presented to the system, the original error surface E ( w) changes to EPnew ( w) . (For simplicity, we will discuss only the case where a single new pattern is presented to the network, but the argument is identical for any number of new patterns.) In general, w 0, a minimum of the original error surface, E ( w) , will no longer be a minimum of EPnew ( w) . In other words, the network “forgets” the original error surface E ( w) . What we need is some way for the network to approximate E ( w) in the absence of the original patterns. We could then create a new overall error surface that would reect both E ( w) and E Pnew ( w) . We do this by taking a weighted sum of the approximation of E (w ) , which we will call EO ( w) , and
Using Noise to Compute Error Surfaces
1757
EPnew (w ) . Our weight change algorithm will then be gradient descent on this combined error surface. In what follows, we will develop the mathematics of HPBP and demonstrate the algorithm by means of two simple simulations on empirical data: one in which we sequentially learn two sets of patterns to criterion, the other in which the network is presented with a series of patterns, each of which is learned to criterion before the presentation of the next pattern. 4 Noise and the Calculation of an Error Surface
Assume, as above, that E ( w) is the unique error surface dened by a set of real input-output patterns fPi : Ii ! Oi gN i D 1 learned by the network. The network’s having learned these patterns means it has discovered a local minimum w 0 in weight-space for which E0 ( w0 ) D 0, where E0 ( w) represents the rst derivative of the error function. If the function f underlying the original set of patterns is relatively “nice” (continuous, reasonably smooth, and so forth), then a set of pseudopatterns fyi : Ii ! Oi giMD 1 whose input values are drawn from a at random distribution will produce a reasonable approximation of f . (See French, Ans, & Rousset, 2001, for a discussion of how this approximationcould be improved by additionally making use of the values of the output associated with each random input, or Ans & Rousset, 2000, for a technique that produces an “attractor ” input pattern from uniform random input. In the present case, however, we simply use at random input to produce the pseudopatterns.) Just as the original set of patterns fPi gN i D1 had a unique error surface associated with it, so does the set of pseudopatterns fyi gM i D 1 . For this latter error surface, E ( w ), it follows from the denition of pseudopatterns that E ( w 0 ) D 0 0
and E ( w0 ) D 0. The question is how we can produce this approximation of the original error surface in the vicinity of w 0 (assuming that the original patterns fPi gN i D1 are no longer available). We know that for the original error surface E ( w) , E 0 ( w 0 ) D 0. While this tells us that w 0 is a local minimum of E, it does not provide any information about the shape (in particular, the steepness) of E (w ) around w0 , which is what we want. For this, we need the higher derivatives of E ( w) , which, unlike the rst derivative, do not disappear when evaluated at w 0 . Using this steepness and the local minimum information, we reconstruct the desired approximation of the original error surface by means of a truncated Taylor series. (For other techniques using the higher-order derivatives to improve backpropagation, see, for example, Bishop, 1991, and Becker & Le Cun, 1988, among others.) Somewhat counterintuitively, approximating the original error surface using noise does not require any explicit error information; noise moving through the system is sufcient for the calculation. Thus, unlike other techniques that make use of pseudopatterns that require the system to learn a
1758
Robert M. French and Nick Chater
mixture of pseudopatterns and new patterns (Robins, 1995; French, 1997; Ans & Rousset, 1997, 2000), here noise is simply sent through the system, and this alone allows us to approximate the error surface around w 0. This approximated error surface, combined with the error surface of the new patterns to be learned, produces an overall error surface on which gradient descent will be performed. The details of this calculation follow. 5 Hessian Pseudopattern Backpropagation (HPBP)
Assume that the network has already stored a number of patterns and has found a point wo in weight space for which all the previously learned patterns have been learned to criterion. Further assume that we are using the standard quadratic error function, outputs P NX 1X p p ( y i ¡ ti ) 2 , ED 2 pD 1 i D 1
(5.1)
where P is the number of patterns, Noutputs is the number of output units in p p the network, yi is the output of the ith output node for the pth pattern, and ti is the teacher for the ith output node for the pth pattern. E being a continuous, everywhere differentiable function, it has a Taylor series expansion about wo , which we can write as follows: E (w ) D E ( w0 ) C E0 ( w0 ) (w ¡w 0 ) C
1 ( w ¡w0 ) T H| w0 (w ¡w 0 ) C . . . , 2!
(5.2)
where w is a point in weight space, wo is the point in weight space at which the network has arrived after learning the original patterns, E0 ( w0 ) is the gradient of E evaluated at wo , and H| w0 is the Hessian matrix of second partial derivatives of E evaluated at wo . For values of w sufciently close to w0 , we will assume that we can truncate the Taylor series after the second term. Since the network is at w 0 after having learned the original patterns, this implies that w 0 is a local minimum of the error surface and, consequently, E0 ( w0 ) is 0. We can therefore write the truncated Taylor series approximation of the error surface corresponding to the originally learned set of patterns as 1 EO (w ) ¼ E ( w0 ) C (w ¡ w 0 ) T H| w0 ( w ¡ w 0 ) . 2!
(5.3)
Now assume that a new pattern, P, is presented to the network. This pattern induces an error surface, EP ( w) (as mentioned above, the argument is the same for a set of new patterns):
Using Noise to Compute Error Surfaces
1759
Let E (w ) D a E (w ) C EP ( w) , where the constant, a, is a weighting factor. The standard delta rule gives:
D w D ¡g
@E
where
@E
Da
@w @w But from equation 5.3 we have @ E ( w) @w
@ E ( w) @w
C
@EP @w
D H| w0 ( w ¡ w 0 )
The weight change rule will therefore be µ ¶ @EP D w D ¡g aH| w0 ( w ¡ w0 ) C @w
(5.4)
(5.5)
where g is the learning rate and a is the weighting factor of the prior approximated error surface. We will now show how noise allows us to calculate H| w0 . 6 Noise and the Calculation of H| w0
For each pseudopattern, the teacher and the output will, by denition, be the same. In other words, y
y
8y 2Y 8n ( yn ¡ tn ) D 0
(6.1)
where Y is the set of all pseudopatterns, yn is the output from the nth output node of the network, and tn is the teacher for the nth output node of the network. The Hessian matrix evaluated at wo is dened as follows: 2 3 @2 E @2 E ¢ ¢ ¢ 6 @w @w 7 @w1 @wN 7 1 1 6 6 7 .. .. .. H| w0 D 6 7 . . . 6 7 2 4 @2 E 5 @ E ¢¢¢ @wN @w1 @wN @wN w0 where w 0 is a solution for the originally learned set of patterns. Consider the hi, jith term of H: Hij D
@2 E
@wi @wj
We begin with the error function for the pseudopatterns y1 , y1 , y1 , . . . , yNY where NY is the number of pseudopatterns that will be used to calculate the error surface: ED
outputs NY NX 1X p p ( yn ¡ tn ) 2 2 p D1 nD 1
1760
Robert M. French and Nick Chater
where NY is the number of pseudopatterns, Noutputs is the number of output p units of the network, yn is the output of the nth output unit for the pth p pseudopattern, and tn is the teacher for the nth output unit for the pth pseudopattern. The second partial derivatives of E are calculated as follows: @2 E @wi @wj
D
D
D
outputs NY NX @ X
@wi p D1 n D1 outputs NY NX X
pD 1 n D 1 outputs NY NX X
pD 1 n D 1
( ypn ¡ tpn )
p
@yn @wj
Á
@ @wi
Á
p ( yn
p
p ¡ tn )
p
p
@yn @wj
p C ( yn
p ¡ tn ) p
@yn @yn @2 yn p p C ( y n ¡ tn ) @wi @wj @wi @wj
p
@2 yn
!
@wi @wj
!
But we know from equation 6.1 that for pseudopatterns, p
p
8p 8n ( yn ¡ tn ) D 0 and, therefore, the second term above is zero, giving @2 E @wi @wj
D
p p outputs NY NX X @yn @yn @wi @wj pD 1 n D 1
(6.2)
(The precise terms of the pseudopattern-induced Hessian matrix are given in the appendix.) Interestingly, only rst derivative information is required in this pseudopattern-induced Hessian, which means that the complexity of this calculation is O ( N 2 ) , where N is the number of weights in the network. In short, from equations 5.3 and 6.2, we conclude that noise passing through the network is sufcient to approximate the error surface for the original patterns close to w 0. The pseudocode for the HPBP algorithm is shown below. We assume that the network has already learned a set of patterns, P D fPi gN i D 1 , and is at a local minimum w 0 in weight space. The network must then learn a new data set, Q D fQi gM i D1 . To create the Hessian, we use R pseudopatterns: Initialize the Hessian to 0. Set network activation values to 0. Hessian Loop: Put a random input vector through the network to produce a pseudopattern;
Using Noise to Compute Error Surfaces
1761
Use these activation levels and network weight values to create a matrix of second-derivative values to be added to the Hessian H| w0 ; Exit Hessian Loop after R pseudopatterns; Training Loop: For each pattern in Q, do: Feedforward pass; Error backpropagation, changing the weights according to equation 5.5, including momentum; When all patterns in Q are learned to criterion, exit Training Loop; Test errors for all patterns in P. 7 Simulations
In order to show that the HPBP algorithm works, we performed two simulations, the rst involving catastrophic forgetting and the second involving sequential learning. 7.1 Simulation 1: Catastrophic Forgetting. We created two sets of four patterns, P and Q. The two sets were intentionally designed to interfere maximally with one another (even though a network would have been able to learn all patterns in the combined set P [ Q). The network was trained rst on P and then on Q. Once it had learned Q to criterion, we tested it to see how well it had remembered P. An 8-32-1 network was used for both BP and HPBP networks with learning rate 0.01, momentum 0.9, Fahlman offset 0.1, and a maximum weight-change step size of 0.075. For the HPBP network, we used 100 pseudopatterns, and because we wanted to give more weight to the approximated error surface associated with past learning, we set its weighting factor to 8. All results were averaged over 100 runs of the program.
7.1.1 Results. After learning the rst set of patterns P, then Q, the standard backpropagation network produced an average error over all items in P of 0.80. (Thus, as intended, interference of the items in P by the items in Q was extremely severe.) By contrast, the HPBP network produced an average error for these previously learned items of only 0.38. Further, the HPBP network correctly generalized on 67.5% of the previously learned items, whereas the backpropagation network was able to generalize correctly on only 10.25% of the items in P. (See Figures 1a and 1b.) In addition, we measured the number of epochs required for both networks to relearn P. Not surprisingly, HPBP also relearned P to criterion in 42% fewer epochs than the BP network.
1762
Robert M. French and Nick Chater
Figure 1: Catastrophic interference is signicantly reduced with the HPBP algorithm. (a) Errors on the originally learned patterns in set P for BP and HPBP after learning set Q. (b) Correct generalization for the originally learned items for BP and HPBP after learning Q.
Although much work clearly remains to be done on this type of algorithm, we believe that these early results demonstrate that the HPBP algorithm can be very effective in reducing catastrophic interference. 7.2 Simulation 2: Sequential Learning. In order to test the HPBP algorithm further on a sequential learning task drawn from a real-world database, we selected the 1984 Congressional Voting Records database from the UCI repository (Murphy & Aha, 1992). Twenty members of Congress (10 Republicans, 10 Democrats, each “pattern” being dened a yes-no voting record on 16 issues associated with a party afliation) were chosen randomly from the database and were learned sequentially by the network (each pattern was learned to criterion before the next pattern was presented to it). The network was then tested with both standard measures of forgetting on each of the 20 patterns. Both BP and HPBP algorithms used a 16-3-1 feedforward backpropagation network with a learning rate of 0.01, momentum of 0.9, a Fahlman offset of 0.1, with a maximum weight step of 0.09. For the HPBP network, 25 pseudopatterns were generated each time a new pattern was to be learned. The weighting parameter associated with the approximation of the original error surface was set to 3. For each new pattern that was sequentially learned, 25 pseudopatterns were generated to calculate the Hessian and thereby to produce the approximation of the prior error surface. Specically, the network learned the rst pattern, P1 , until the difference between target and output for the pattern was below 0.2. Then 25 pseudopatterns were generated, and the associated error surface was produced. The second pattern, P2 , was then presented
Using Noise to Compute Error Surfaces
1763
Figure 2: Original learning of the 20 items. It is somewhat harder for the Hessian pseudopattern network to learn the rst few patterns, given the “inertial” effect of E ( w ) . Standard error bars show the evolution of the variance for both algorithms.
to the network. The new error surface induced by P2 was combined with the previously approximated error surface, and gradient descent was performed on this combined surface until the network had learned P2 . Then 25 new pseudopatterns were generated to produce an approximation of this error surface. P 3 was then presented to the network, and so on. 7.2.1 Results. First, we considered the extent to which the addition of the approximated error surface made the initial learning more difcult. Second, once the network had sequentially learned all 20 patterns, we measured the error for each of the previously learned patterns (the error measure of forgetting). Finally, we examined how difcult it was to relearn the original patterns (the savings measure of forgetting described above). All of the data reported were averaged over 100 runs of each algorithm. The order of presentation of the patterns and the initial weights of the networks were randomized at the beginning of each run. 7.2.2 Original Learning. Figure 2 shows that, on average, it is more difcult for HPBP to learn the rst few patterns. Presumably, this is because before any learning of the patterns has occurred, E (w ) denes an error surface that is quite unlike the error surface associated with that of any of the 20 patterns to be learned (because the network weights are initialized randomly). However, the average number of epochs required for learning the items with HPBP soon converges to that of BP. 7.2.3 Error Measure of Forgetting. We computed average and median error scores for both HPBP and BP for each item after each network had sequentially learned all 20 items. The inuence of the noise-computed error surface in reducing these errors can be seen in Figures 3 and 4. All results
1764
Robert M. French and Nick Chater
Figure 3: Average error measures for the 20 sequentially learned patterns. The average overall error for HPBP is 0.05 better than for standard BP. The learning criterion is 0.2.
Figure 4: Median errors for BP and HPBP for all items after sequential learning of the original patterns. For HPBP, all 20 patterns remain at or below the 0.2 learning criterion; only 9 of the 20 items are at or below criterion for BP.
were averaged over 100 runs. Standard error bars are shown for the values produced by each algorithm. 7.2.4 Savings Measure of Forgetting. The most striking improvement with respect to standard BP can be seen in the average relearning time for each of the individual items (see Figures 5 and 6). First, all 20 patterns are learned to criterion one after the other by the network. After the twentieth pattern has been learned, the network’s weights, w f inal , are saved. Then each of the rst 19 items is tested to see how many epochs the network requires to relearn it. Specically, the rst pattern in the list is given to the network, and it relearns that pattern to criterion, the number of epochs required for relearning being recorded. The network weights are then reset to w f inal . The
Using Noise to Compute Error Surfaces
1765
50 40 30 20 10 0 1
2 3 4
5 6 7
8 9 10 11 12 13 14 15
17 18 19 20
Figure 5: Average relearning times for all 20 patterns after the network has learned them sequentially. Standard error bars are shown for the means for each pattern in the sequence.
Figure 6: The proportion of patterns requiring no relearning is signicantly higher for HPBP than for BP.
second pattern in the list is then relearned to criterion by the network and the number of epochs required to do so is recorded, and so on. Overall, relearning is about 45% faster for the patterns learned by the HPBP network compared to BP (30.8 epochs for BP versus 16.9 epochs for HPBP). In short, with HPBP, there is still forgetting, but it is shallower forgetting (i.e., relearning of the previously learned patterns is signicantly easier for HPBP than BP).
1766
Robert M. French and Nick Chater
8 Discussion
It is clear that HPBP shows improved forgetting performance compared to standard backpropagation. However, there are a number of issues concerning the generality and computational complexity of this algorithm that must be addressed. We will briey discuss the quality of the approximation of E ( w ) by EO ( w) and whether HPBP would scale to larger networks. The quality of the approximation of E (w ) by EO ( w) is largely dependent on three factors: (1) the form of the original error surface, (2) the choice of pseudopatterns, and (3) the divergence from w 0. If we assume that the error surface close to w 0 is approximately quadratic, then in this neighborhood, we would need only as many points as there are degrees of freedom in the weights to determine the shape of the bowl. Since we assume a quadratic approximation, the number of pseudopatterns required to determine the approximation as the network grows in size should scale with the number of weights. Finally, the further we move from w0 , the less accurate the Taylor series expansion of E will be. HPBP requires only rst-order derivative information and thus has a complexity of O ( n2 ), where n is the number of weights. Further, there exists an O ( n ) approximation of the Hessian (Le Cun, 1987; Becker & Le Cun, 1988) that could also be produced by pseudopatterns. This would ensure that scaling would be linear in the number of weights. This approximation of the Hessian has only diagonal elements, and as a result, weight changes in the HPBP algorithm would require only local information. Explorations of the scaling performance of the network with the diagonal approximation to the Hessian are needed.
9 Conclusion
We have presented a connectionist learning algorithm that signicantly improves network forgetting performance by turning noise to its advantage. A number of authors have shown that a certain amount of noise can enhance the performance of various systems in a wide range of contexts. For example, Linsker (1988) has shown that elementary perceptual detectors can emerge from noise. Numerous authors (e.g., Collins, Chow, & Imhoff, 1995; Grossberg & Grunewald, 1997; SougnÂe, 1998) have shown how the addition of noise to neural networks can enhance weak signal detection. In a neurobiological setting, Douglass, Wilkens, Pantazelou, and Moss (1993), Bezrukov and Vodyanoy (1995), and others have shown that optimal noise intensity in biological neurons can enhance signal detection. In this article, we have shown how noise can be harnessed to improve memory performance in feedforward backpropagation networks. We believe that this work, and the work by others on related problems, represents
Using Noise to Compute Error Surfaces
1767
the tip of the iceberg in the exploration of how noise can be turned from a problem into a performance-enhancing advantage. Appendix: Calculating the Precise Terms of the Pseudopattern-Induced Hessian Matrix
In order to approximate the error surface associated with the originally learned patterns by means of pseudopatterns, it is sufcient to calculate the terms of the Hessian matrix. The Hessian matrix evaluated at wo is dened as follows: 2 3 @2 E @2 E ¢¢¢ 6 @w @w 7 @ w @ w 1 1 1 N 7 6 6 7 .. .. .. H| w0 D 6 7 . . . 6 7 4 @2 E @2 E 5 ¢¢¢ @wN @w1 @wN @wN w0 0 ), where w E 0 is the vector of weights ( w10 , w20 , w30, . . . , wN which was a solution for the originally learned set of patterns. We have shown in equation 6.2 that for any two weights, wi and wj , in a network with Noutputs output nodes and for NY pseudopatterns, the hi, jith term of the Hessian matrix is p p outputs NY NX X @yn @yn D @wi @wj @wi @wj p D1 nD 1
@2 E
For each pseudopattern and each output node (with output y) and for all pairs of weights wi and wj , we calculate @y @y @wi @wj
Each term of the Hessian matrix will be the sum over all output nodes and over all pseudopatterns. The notation conventions are as follows: ya is the output from node a; y0a is the rst derivative of the squashing function, evaluated at ya , and for the standard squashing function, y D 1 C1e¡x we have y0a D ya (1 ¡ ya ) ; Noutputs is the number of output nodes; and wab , wcd the weights from node b to node a and from node d to node c. There are three cases of pairs of weights to consider: Case I: When wab , wcd 2 hidden-output layer: ( ( y0a yb ) ( y0c yd ) D ( y0a ) 2 yb yd @2 E @wab @wcd
D
0
if a D c 6 c if a D
1768
Robert M. French and Nick Chater
Case II: When wab 2 input-hidden layer and wcd 2 hidden-output layer: @2 E
@wab @wcd
D ( y0a yb ) ( y0c yd ) ( y0a wac ) D ( y0a ) 2 y0c yb yd wac
Case III: When wab , wcd 2 input-hidden layer: @2 E @wab @wcd
D ( y0a yb ) ( y0c yd ) D y0a y0c yb yd
NX outputs
( y0i wia ) ( y0i wic )
iD 1
NX outputs
( y0i ) 2 wia wic
i D1
Acknowledgments
This work has been supported in part by a grant from the European Commission, HPRN-CT-1999-00065. We thank Anthony Robins and Dave Noelle for their discussions of the ideas of this work. Particular thanks go to Gary Cottrell and an anonymous reviewer whose insightful comments contributed signicantly to the quality of this article. References Ans, B., & Rousset, S. (1997). Avoiding catastrophic forgetting by coupling two reverberating neural networks. AcadÂemie des Sciences, Sciences de la vie, 320, 989–997. Ans, B., & Rousset, S. (2000). Neural networks with a self-refreshing memory: Knowledge transfer in sequential learning tasks without catastrophic forgetting. Connection Sciences, 12, 1–19. Becker, S., & Le Cun, Y. (1988). Improving the convergence of back-propagation learning with second order methods. In D. S. Touretzky, G. E. Hinton, & T. J. Sejnowski (Eds.), Proceedings of the 1988 Connectionist Models Summer School (pp. 29–37). San Mateo, CA: Morgan Kauffman. Bezrukov, S., & Vodyanoy, I. (1995). Noise induced enhancement of signal transduction across voltage-dependent ion channels. Nature, 378, 362–364. Bishop, C. (1991). A fast procedure for retraining the multilayer perceptron. International Journal of Neural Systems, 2(3), 229–236. Collins, J., Chow, C., & Imhoff, T. (1995). Stochastic resonance without tuning. Nature, 376, 236–238. Douglass, J., Wilkens, L., Pantazelou, E., & Moss, F. (1993).Noise enhancement of information transfer in craysh mechanoreceptors by stochastic resonance. Nature, 365, 337–340. French, R. M. (1997). Pseudo-recurrent connectionist networks: An approach to the “sensitivity-stability” dilemma. Connection Science, 9, 353–379. French, R. M. (1999). Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 3(4), 128–135.
Using Noise to Compute Error Surfaces
1769
French, R. M., Ans, B., & Rousset, S. (2001). Pseudopatterns and dual-network memory models: Advantages and shortcomings. In R. French & J. SougnÂe (Eds.), Connectionist models of learning, development and evolution. London: Springer-Verlag. Grossberg, S., & Grunewald, A. (1997). Cortical synchronization and perceptual framing. Journal of Cognitive Neuroscience, 9, 117–132. Hetherington, P., & Seidenberg, M. (1989). Is there “catastrophic interference” in connectionist networks? In Proceedings of the 11th Annual Conference of the Cognitive Science Society (pp. 26–33). Hillsdale, NJ: Erlbaum. Le Cun, Y. (1987).Mod`eles connexionnistes de l’apprentissage.Unpublished doctoral dissertation, Universit e Pierre et Marie Curie, Paris, France. Linsker, R. (1988, March). Self-organization in a perceptual network. Computer, 105–117. Murphy, P., & Aha, D. (1992). UCI repository of machine learning databases. Irvine, CA: University of California. Robins, A. (1995). Catastrophic forgetting, rehearsal, and pseudorehearsal. Connection Science, 7, 123–146. SougnÂe, J. (1998). Period doubling as a means of representing multiplyinstantiated entities. In Proceedingsof the 20th Annual Conference of the Cognitive Science Society (pp. 1007–1012). Hillside, NJ: Erlbaum. Received June 4, 2001; accepted January 7, 2002.
ARTICLE
Communicated by Javier Movellan
Training Products of Experts by Minimizing Contrastive Divergence Geoffrey E. Hinton
[email protected] Gatsby Computational Neuroscience Unit, University College London, London WC1N 3AR, U.K.
It is possible to combine multiple latent-variable models of the same data by multiplying their probability distributions together and then renormalizing. This way of combining individual “expert” models makes it hard to generate samples from the combined model but easy to infer the values of the latent variables of each expert, because the combination rule ensures that the latent variables of different experts are conditionally independent when given the data. A product of experts (PoE) is therefore an interesting candidate for a perceptual system in which rapid inference is vital and generation is unnecessary. Training a PoE by maximizing the likelihood of the data is difcult because it is hard even to approximate the derivatives of the renormalization term in the combination rule. Fortunately, a PoE can be trained using a different objective function called “contrastive divergence” whose derivatives with regard to the parameters can be approximated accurately and efciently. Examples are presented of contrastive divergence learning using several types of expert on several types of data.
1 Introduction One way of modeling a complicated, high-dimensional data distribution is to use a large number of relatively simple probabilistic models and somehow combine the distributions specied by each model. A well-known example of this approach is a mixture of gaussians in which each simple model is a gaussian, and the combination rule consists of taking a weighted arithmetic mean of the individual distributions. This is equivalent to assuming an overall generative model in which each data vector is generated by rst choosing one of the individual generative models and then allowing that individual model to generate the data vector. Combining models by forming a mixture is attractive for several reasons. It is easy to t mixtures of tractable models to data using expectation-maximization (EM) or gradient ascent, and mixtures are usually considerably more powerful than their individual components. Indeed, if sufciently many models are included in c 2002 Massachusetts Institute of Technology Neural Computation 14, 1771–1800 (2002) °
1772
Geoffrey E. Hinton
the mixture, it is possible to approximate complicated smooth distributions arbitrarily accurately. Unfortunately, mixture models are very inefcient in high-dimensional spaces. Consider, for example, the manifold of face images. It takes about 35 real numbers to specify the shape, pose, expression, and illumination of a face, and under good viewing conditions, our perceptual systems produce a sharp posterior distribution on this 35-dimensional manifold. This cannot be done using a mixture of models, each tuned in the 35-dimensional manifold, because the posterior distribution cannot be sharper than the individual models in the mixture and the individual models must be broadly tuned to allow them to cover the 35-dimensional manifold. A very different way of combining distributions is to multiply them together and renormalize. High-dimensional distributions, for example, are often approximated as the product of one-dimensional distributions. If the individual distributions are uni- or multivariate gaussians, their product will also be a multivariate gaussian so, unlike mixtures of gaussians, products of gaussians cannot approximate arbitrary smooth distributions. If, however, the individual models are a bit more complicated and each contains one or more latent (i.e., hidden) variables, multiplying their distributions together (and renormalizing) can be very powerful. Individual models of this kind will be called “experts.” Products of experts (PoE) can produce much sharper distributions than the individual expert models. For example, each expert model can constrain a different subset of the dimensions in a high-dimensional space, and their product will then constrain all of the dimensions. For modeling handwritten digits, one low-resolution model can generate images that have the approximate overall shape of the digit, and other more local models can ensure that small image patches contain segments of stroke with the correct ne structure. For modeling sentences, each expert can enforce a nugget of linguistic knowledge. For example, one expert could ensure that the tenses agree, one could ensure that there is number agreement between the subject and verb, and one could ensure that strings in which color adjectives follow size adjectives are more probable than the the reverse. Fitting a PoE to data appears difcult because it appears to be necessary to compute the derivatives, with repect to the parameters, of the partition function that is used in the renormalization. As we shall see, however, these derivatives can be nessed by optimizing a less obvious objective function than the log likelihood of the data.
2 Learning Products of Experts by Maximizing Likelihood We consider individual expert models for which it is tractable to compute the derivative of the log probability of a data vector with respect to the
Training Products of Experts
1773
parameters of the expert. We combine n individual expert models as follows: P m fm (d | µ m ) p (d | µ1 , . . . , µ n ) D P , c P m fm (c | µ m )
(2.1)
where d is a data vector in a discrete space, µ m is all the parameters of individual model m, fm (d | µ m ) is the probability of d under model m, and c indexes all possible vectors in the data space.1 For continuous data spaces, the sum is replaced by the appropriate integral. For an individual expert to t the data well, it must give high probability to the observed data, and it must waste as little probability as possible on the rest of the data space. A PoE, however, can t the data well even if each expert wastes a lot of its probability on inappropriate regions of the data space, provided different experts waste probability in different regions. The obvious way to t a PoE to a set of observed independently and identically distributed (i.i.d.) data vectors 2 is to follow the derivative of the log likelihood of each observed vector, d, under the PoE. This is given by @ log p (d | µ 1 , . . . , µ n ) @ log fm (d | µ m ) D @µm @µm
¡
X c
p (c | µ 1 , . . . , µ n )
@ log fm (c | µ m ) @µm
.
(2.2)
The second term on the right-hand side of equation 2.2 is just the expected derivative of the log probability of an expert on fantasy data, c, that is generated from the PoE. So assuming that each of the individual experts has a tractable derivative, the obvious difculty in estimating the derivative of the log probability of the data under the PoE is generating correctly distributed fantasy data. This can be done in various ways. For discrete data, it is possible to use rejection sampling. Each expert generates a data vector independently, and this process is repeated until all the experts happen to agree. Rejection sampling is a good way of understanding how a PoE species an overall probability distribution and how different it is from a causal model, but it is typically very inefcient. Gibbs sampling is typically much more efcient. In Gibbs sampling, each variable draws a sample from its posterior distribution given the current states of the other variables (Geman & Geman, 1984). Given the data, the hidden states of all the experts can always be updated in parallel because they are conditionally independent. This is an important consequence of the product formulation.3 If the 1 So long as fm (d | µm ) is positive, it does not need to be a probability at all, though it will generally be a probability in this article. 2 For time-series models, d is a whole sequence. 3 The conditional independence is obvious in the undirected graphical model of a PoE because the only path between the hidden states of two experts is via the observed data.
1774
Geoffrey E. Hinton
Figure 1: A visualization of alternating Gibbs sampling. At time 0, the visible variables represent a data vector, and the hidden variables of all the experts are updated in parallel with samples from their posterior distribution given the visible variables. At time 1, the visible variables are all updated to produce a reconstruction of the original data vector from the hidden variables, and then the hidden variables are again updated in parallel. If this process is repeated sufciently often, it is possible to get arbitrarily close to the equilibrium distribution. The correlations hsi sj i shown on the connections between visible and hidden variables are the statistics used for learning in RBMs, which are described in section 7.
individual experts also have the property that the components of the data vector are conditionally independent given the hidden state of the expert, the hidden and visible variables form a bipartite graph, and it is possible to update all of the components of the data vector in parallel given the hidden states of all the experts. So Gibbs sampling can alternate between parallel updates of the hidden and visible variables (see Figure 1). To get an unbiased estimate of the gradient for the PoE, it is necessary for the Markov chain to converge to the equilibrium distribution. Unfortunately, even if it is computationally feasible to approach the equilibrium distribution before taking samples, there is a second, serious difculty. Samples from the equilibrium distribution generally have high variance since they come from all over the model’s distribution. This high variance swamps the estimate of the derivative. Worse still, the variance in the samples depends on the parameters of the model. This variation in the variance causes the parameters to be repelled from regions of high variance even if the gradient is zero. To understand this subtle effect, consider a horizontal sheet of tin that is resonating in such a way that some parts have strong vertical oscillations and other parts are motionless. Sand scattered on the tin will accumulate in the motionless areas even though the time-averaged gradient is zero everywhere. 3 Learning by Minimizing Contrastive Divergence Maximizing the log likelihood of the data (averaged over the data distribution) is equivalent to minimizing the Kullback-Leibler divergence between the data distribution, P 0, and the equilibrium distribution over the visi-
Training Products of Experts
1775
ble variables, Ph1 , that is produced by prolonged Gibbs sampling from the generative model,4 P 0 k Ph1 D
X d
P 0 (d) log P 0 (d) ¡
D ¡H ( P
X d
P 0 (d) log Ph1 (d)
¡ hlog Ph1 iP0 ,
0)
(3.1)
where k denotes a Kullback-Leibler divergence, the angle brackets denote expectations over the distribution specied as a subscript, and H ( P 0 ) is the entropy of the data distribution. P 0 does not depend on the parameters of the model, so H ( P 0 ) can be ignored during the optimization. Note that Ph1 (d) is just another way of writing p (d | µ 1 , . . . , µn ). Equation 2.2, averaged over the data distribution, can be rewritten as ½
@ log Ph1 (D) @µ m
¾
½ P0
D
@ log fhm @µm
¾
½ P0
¡
@ log fhm @µm
¾ ,
(3.2)
Ph1
where log fhm is a random variable that could be written as log fm (D | µ m ) with D itself being a random variable corresponding to the data. There is a simple and effective alternative to maximum likelihood learning that eliminates almost all of the computation required to get samples from the equilibrium distribution and also eliminates much of the variance that masks the gradient signal. This alternative approach involves optimizing a different objective function. Instead of just minimizing P 0 k Ph1 , we minimize the difference between P 0 k Ph1 and Ph1 k Ph1 where Ph1 is the distribution over the “one-step” reconstructions of the data vectors generated by one full step of Gibbs sampling (see Figure 1). The intuitive motivation for using this “contrastive divergence” is that we would like the Markov chain that is implemented by Gibbs sampling to leave the initial distribution P 0 over the visible variables unaltered. Instead of running the chain to equilibrium and comparing the initial and nal derivatives, we can simply run the chain for one full step and then update the parameters to reduce the tendency of the chain to wander away from the initial distribution on the rst step. Because Ph1 is one step closer to the equilibrium distribution than P 0 , we are guaranteed that P 0 k Ph1 exceeds Ph1 k Ph1 unless P 0 equals Ph1 , so the contrastive divergence can never be negative. Also, for Markov chains in which all transitions have nonzero probability, P 0 D Ph1 implies P 0 D Ph1 , because if the distribution does not 4 0 P is a natural way to denote the data distribution if we imagine starting a Markov chain at the data distribution at time 0.
1776
Geoffrey E. Hinton
change at all on the rst step, it must already be at equilibrium, so the contrastive divergence can be zero only if the model is perfect.5 Another way of understanding contrastive divergence learning is to view it as a method of eliminating all the ways in which the PoE model would like to distort the true data. This is done by ensuring that, on average, the reconstruction is no more probable under the PoE model than the original data vector. The mathematical motivation for the contrastive divergence is that the intractable expectation over Ph1 on the right-hand side of equation 3.2 cancels out: ½ ¾ ½ ¾ @ @ log fhm @ log fhm (P 0 k Ph1 ¡ Ph1 k Ph1 ) D ¡ ¡ @µ m
@µ m
C
P0
@µ m
Ph1
@Ph1 @ ( Ph1 k Ph1 ) . @µm @Ph1
(3.3)
If each expert is chosen to be tractable, it is possible to compute the exact values of the derivative of log fm (d | µm ) for a data vector, d. It is also straightforward to sample from P 0 and Ph1 , so the rst two terms on the right-hand side of equation 3.3 are tractable. By denition, the following procedure produces an unbiased sample from Ph1 : 1. Pick a data vector, d, from the distribution of the data P 0 . 2. Compute, for each expert separately, the posterior probability distribution over its latent (i.e., hidden) variables given the data vector, d. 3. Pick a value for each latent variable from its posterior distribution. 4. Given the chosen values of all the latent variables, compute the conditional distribution over all the visible variables by multiplying together the conditional distributions specied by each expert and renormalizing. 5. Pick a value for each visible variable from the conditional distribution. ˆ These values constitute the reconstructed data vector, d. The third term on the right-hand side of equation 3.3 represents the effect on Ph1 k Ph1 of the change of the distribution of the one-step reconstructions caused by a change in µ m . It is problematic to compute, but extensive simulations (see section 10) show that it can safely be ignored because it is small 5 It is obviously possible to make the contrastive divergence small by using a Markov chain that mixes very slowly, even if the data distribution is very far from the eventual equilibrium distribution. It is therefore important to ensure mixing by using techniques such as weight decay that ensure that every possible visible vector has a nonzero probability given the states of the hidden variables.
Training Products of Experts
1777
and seldom opposes the resultant of the other two terms. The parameters of the experts can therefore be adjusted in proportion to the approximate derivative of the contrastive divergence: ½
D µm /
@ log fhm @µ m
¾
½ P0
¡
@ log fhm @µ m
¾ .
(3.4)
Ph1
This works very well in practice even when a single reconstruction of each data vector is used in place of the full probability distribution over reconstructions. The difference in the derivatives of the data vectors and their reconstructions has some variance because the reconstruction procedure is stochastic. But when the PoE is modeling the data moderately well, the onestep reconstructions will be very similar to the data, so the variance will be very small. The close match between a data vector and its reconstruction reduces sampling variance in much the same way as the use of matched pairs for experimental and control conditions in a clinical trial. The low variance makes it feasible to perform on-line learning after each data vector is presented, though the simulations described in this article use mini-batch learning in which the parameter updates are based on the summed gradients measured on a rotating subset of the complete training set. 6 There is an alternative justication for the learning algorithm in equation 3.4. In high-dimensional data sets, the data nearly always lie on, or close to, a much lower-dimensional, smoothly curved manifold. The PoE needs to nd parameters that make a sharp ridge of log probability along the low-dimensional manifold. By starting with a point on the manifold and ensuring that the typical reconstructions from the latent variables of all the experts do not have signicantly higher probability, the PoE ensures that the probability distribution has the right local structure. It is possible that the PoE will accidentally assign high probability to other distant and unvisited parts of the data space, but this is unlikely if the log probability surface is smooth and both its height and its local curvature are constrained at the data points. It is also possible to nd and eliminate such points by performing prolonged Gibbs sampling without any data, but this is just a way of improving the learning and not, as in Boltzmann machine learning, an essential part of it. 4 A Simple Example PoEs should work very well on data distributions that can be factorized into a product of lower-dimensional distributions. This is demonstrated in Figures 2 and 3. There are 15 “unigauss” experts, each of which is a 6 Mini-batch learning makes better use of the ability of Matlab to vectorize across training examples.
1778
Geoffrey E. Hinton
Figure 2: Each dot is a data point. The data have been tted with a product of 15 experts. The ellipses show the one standard deviation contours of the gaussians in each expert. The experts are initialized with randomly located, circular gaussians that have about the same variance as the data. The ve unneeded experts remain vague, but the mixing proportions, which determine the prior probability with which each of these unigauss experts selects its gaussian rather than its uniform, remain high.
mixture of a uniform distribution and a single axis-aligned gaussian. In the tted model, each tight data cluster is represented by the intersection of two experts’ gaussians, which are elongated along different axes. Using a conservative learning rate, the tting required 2000 updates of the parameters. For each update of the parameters, the following computation is performed on every observed data vector: 1. Given the data, d, calculate the posterior probability of selecting the gaussian rather than the uniform in each expert and compute the rst term on the right-hand side of equation 3.4. 2. For each expert, stochastically select the gaussian or the uniform according to the posterior. Compute the normalized product of the selected gaussians, which is itself a gaussian, and sample from it to get a “reconstructed” vector in the data space. To avoid problems, there is one special expert that is constrained to always pick its gaussian. 3. Compute the negative term in equation 3.4 using the reconstructed data vector. 5 Learning a Population Code A PoE can also be a very effective model when each expert is quite broadly tuned on every dimension and precision is obtained by the intersection of a
Training Products of Experts
1779
Figure 3: Three hundred data points generated by prolonged Gibbs sampling from the 15 experts tted in Figure 1. The Gibbs sampling started from a random point in the range of the data and used 25 parallel iterations with annealing. Notice that the tted model generates data at the grid point that is missing in the real data.
large number of experts. Figure 4 shows what happens when the contrastive divergence learning algorithm is used to t 40 unigauss experts to 100dimensional synthetic images that each contain one edge. The edges varied in their orientation, position, and intensities on each side of the edge. The intensity prole across the edge was a sigmoid. Each expert also learned a variance for each pixel, and although these variances varied, individual experts did not specialize in a small subset of the dimensions. Given an image, about half of the experts have a high probability of picking their gaussian rather than their uniform. The products of the chosen gaussians are excellent reconstructions of the image. The experts at the top of Figure 4 look like edge detectors in various orientations, positions, and polarities. Many of the experts farther down have even symmetry and are used to locate one end of an edge. They each work for two different sets of edges that have opposite polarities and different positions. 6 Initializing the Experts One way to initialize a PoE is to train each expert separately, forcing the experts to differ by giving them different or differently weighted training cases or by training them on different subsets of the data dimensions, or by using different model classes for the different experts. Once each expert has been initialized separately, the individual probability distributions need to be raised to a fractional power to create the initial PoE. Separate initialization of the experts seems like a sensible idea, but simulations indicate that the PoE is far more likely to become trapped in poor local optima if the experts are allowed to specialize separately. Better solutions are obtained by simply initializing the experts randomly with very vague distributions and using the learning rule in equation 3.4.
1780
Geoffrey E. Hinton
Figure 4: (a) Some 10 £ 10 images that each contain a single intensity edge. The location, orientation, and contrast of the edge all vary. (b) The means of all the 100-dimensional gaussians in a product of 40 experts, each of which is a mixture of a gaussian and a uniform. The PoE was tted to 500 images of the type shown on the left. The experts have been ordered by hand so that qualitatively similar experts are adjacent.
7 PoEs and Boltzmann Machines The Boltzmann machine learning algorithm (Hinton & Sejnowski, 1986) is theoretically elegant and easy to implement in hardware but very slow in networks with interconnected hidden units because of the variance problems described in section 2. Smolensky (1986) introduced a restricted type of Boltzmann machine with one visible layer, one hidden layer, and no intralayer connections. Freund and Haussler (1992) realized that in this restricted Boltzmann machine (RBM), the probability of generating a visible vector is proportional to the product of the probabilities that the visible vector would be generated by each of the hidden units acting alone. An RBM is therefore a PoE with one expert per hidden unit.7 When the hidden unit
7 Boltzmann machines and PoEs are very different classes of probabilistic generative model, and the intersection of the two classes is RBMs.
Training Products of Experts
1781
of an expert is off, it species a separable probability distribution in which each visible unit is equally likely to be on or off. When the hidden unit is on, it species a different factorial distribution by using the weight on its connection to each visible unit to specify the log odds that the visible unit is on. Multiplying together the distributions over the visible states specied by different experts is achieved by simply adding the log odds. Exact inference of the hidden states given the visible data is tractable in an RBM because the states of the hidden units are conditionally independent given the data. The learning algorithm given by equation 2.2 is exactly equivalent to the standard Boltzmann learning algorithm for an RBM. Consider the derivative of the log probability of the data with respect to the weight wij between a visible unit i and a hidden unit j. The rst term on the right-hand side of equation 2.2 is @ log fj (d | wj ) @wij
D hsi sj id ¡ hsi sj iPh1 ( j) ,
(7.1)
where wj is the vector of weights connecting hidden unit j to the visible units, hsi sj id is the expected value of si sj when d is clamped on the visible units and sj is sampled from its posterior distribution given d, and hsi sj iPh1 ( j) is the expected value of si sj when alternating Gibbs sampling of the hidden and visible units is iterated to get samples from the equilibrium distribution in a network whose only hidden unit is j. The second term on the right-hand side of equation 2.2 is: X
p (c | w)
@ log fj (c | wj )
c
@wij
D hsi sj iPh1 ¡ hsi sj iPh1 ( j) ,
(7.2)
where w is all of the weights in the RBM and hsi sj iPh1 is the expected value of si sj when alternating Gibbs sampling of all the hidden and all the visible units is iterated to get samples from the equilibrium distribution of the RBM. Subtracting equation 7.2 from equation 7.1 and taking expectations over the distribution of the data gives ½
@ log Ph1 @wij
¾ P0
D ¡
@ ( P 0 k Ph1 ) D hsi sj iP0 ¡ hsi sj iPh1 . @wij
(7.3)
The time required to approach equilibrium and the high sampling variance in hsi sj iPh1 make learning difcult. It is much more effective to use the approximate gradient of the contrastive divergence. For an RBM, this approximate gradient is particularly easy to compute: ¡
@ @wij
( P 0 k Ph1 ¡ Ph1 k Ph1 ) ¼ hsi sj iP0 ¡ hsi sj iP1 , h
(7.4)
1782
Geoffrey E. Hinton
where hsi sj iP1 is the expected value of si sj when one-step reconstructions are h clamped on the visible units and sj is sampled from its posterior distribution given the reconstruction (see Figure 1). 8 Learning the Features of Handwritten Digits When presented with real high-dimensional data, a restricted Boltzmann machine trained to minimize the contrastive divergence using equation 7.4 should learn a set of probabilistic binary features that model the data well. To test this conjecture, an RBM with 500 hidden units and 256 visible units was trained on 8000 16 £ 16 real-valued images of handwritten digits from all 10 classes. The images, from the “br” training set on the USPS Cedar ROM1, were normalized in width and height, but they were highly variable in style. The pixel intensities were normalized to lie between 0 and 1 so that they could be treated as probabilities, and equation 7.4 was modied to use probabilities in place of stochastic binary values for both the data and the one-step reconstructions:
¡
@ @wij
( P 0 k Ph1 ¡ Ph1 k Ph1 ) ¼ hpi pj iP0 ¡ hpi pj iP1 . h
(8.1)
Stochastically chosen binary states of the hidden units were still used for computing the probabilities of the reconstructed pixels, but instead of picking binary states for the pixels from those probabilities, the probabilities themselves were used as the reconstructed data vector. It takes 10 hours in Matlab 5.3 on a 500 MHz pentium II workstation to perform 658 epochs of learning. This is much faster than standard Boltzmann machine learning, comparable with the wake-sleep algorithm (Hinton, Dayan, Frey, & Neal, 1995) and considerably slower than using EM to t a mixture model with the same number of parameters. In each epoch, the weights were updated 80 times using the approximate gradient of the contrastive divergence computed on mini-batches of size 100 that contained 10 exemplars of each digit class. The learning rate was set empirically to be about one-quarter of the rate that caused divergent oscillations in the parameters. To improve the learning speed further, a momentum method was used. After the rst 10 epochs, the parameter updates specied by equation 8.1 were supplemented by adding 0.9 times the previous update. The PoE learned localized features whose binary states yielded almost perfect reconstructions. For each image, about one-third of the features were turned on. Some of the learned features had on-center off-surround receptive elds or vice versa, some looked like pieces of stroke, and some looked like Gabor lters or wavelets. The weights of 100 of the hidden units, selected at random, are shown in Figure 5.
Training Products of Experts
1783
Figure 5: The receptive elds of a randomly selected subset of the 500 hidden units in a PoE that was trained on 8000 images of digits with equal numbers from each class. Each block shows the 256 learned weights connecting a hidden unit to the pixels. The scale goes from C 2 (white) to ¡2 (black).
9 Using Learned Models of Handwritten Digits for Discrimination An attractive aspect of PoEs is that it is easy to compute the numerator in equation 2.1, so it is easy to compute the log probability of a data vector up to an additive constant, log Z, which is the log of the denominator in equation 2.1. Unfortunately, it is very hard to compute this additive constant. This does not matter if we want to compare only the probabilities of two different data vectors under the PoE, but it makes it difcult to evaluate the model learned by a PoE, because the obvious way to measure the success of learning is to sum the log probabilities that the PoE assigns to test data vectors drawn from the same distribution as the training data but not used during training. For a novelty detection task, it would be possible to train a PoE on “normal” data and then to learn a single scalar threshold value for the unnormal-
1784
Geoffre